How to tune/optimize a DQC load plan?


Userlevel 3
Badge +1

How to tune/optimize a DQC load plan?

In day-to-day life, the consultants may come across several performance issues of various magnitude in their load plans at some point in the projects. There are several optimization tips mentioned below for tuning the long running plans.

  1. Check the memory allocation

DQC plans store the data while processing, which means that the more input data there is, the more memory DQC will require. Therefore the intensive executions can result in a memory overflow error if the default memory allocated for DQC is consumed. In such cases, an increase of memory could present a solution.

Related Errors:

  • OutOfMemoryError: Java Heap Space or GC Overhead Limit Exceeded...
  • java.lang.OutOfMemoryError: unable to create new native thread... 

The memory allocation can be done in several ways,

  • Increasing the memory of the DQC IDE(Desktop)
  • Increasing the memory allocation to specific plan in the IDE
  • Increasing the memory allocation for the entire project in IDE
  • Increasing the memory on the DQC Web Console on a server

  Please visit the below link for more information about memory configuration,

                   How to Optimize memory?

  1. Splitter step 

The splitter step is used to split the input into separate words which is generally performance intensive. This step is extensively used in the post processing components for the DQ Monitoring projects where you would want to split the multiple invalid rules & explanation which is appears to be originally concatenated. The split records are generally loaded into a database table or DQIT based on the requirement.

One important tip is to remove the unused columns using an “Alter format” step prior to the “splitter” step which can take off the load on splitting the single record into multiple records and considerable improve the performance

 

  1. Join step

Joining of two tables/datasets is a very common requirement in the plan which can cause performance overhead especially when you are joining large datasets. If you select the default match strategy ”MERGE”, a temp file is created in the background to store the data in processing which might result in disk space error occasionally.

The important tip is to use the “HASH_LEFT” & ”HASH_RIGHT” match strategy if you are joining a large dataset with a smaller dataset. This has significant improvement on the performance as it will create a lookup file in the background.

Remember not use Hash strategy when both the datasets are large in the join.

Value

Description

MERGE

Generic pairing technique based on grouping whole data from both inputs.

HASH_LEFT

Records from left input is used for hash table.

HASH_RIGHT

Records from right input is used for hash table.

 

  1. Replacing the Row-by-Row Update with Bulk update

If your plan is using an “SQL Execute” step to run an update query in to update a column for every row in the input flow, it may cause performance issues especially on a large size table.

In such scenarios, please consider the below options to improve the performance,

  • Remember to commit only smaller batches which will end the transaction and clear the database memory
  • Check if there is a way to replace the row-by-row update with one update query that will bulk update the table which takes very less time.
  1. Using Group Aggregator instead of Representative Creator

If you want to perform simple aggregations, it is recommended to use “Group Aggregator” or “Aggregating Column Assigner” steps over the “Representative Creator” step in the plan.

The “Representative Creator” step is mainly used for advanced grouping strategies with selection rules. Choosing the right aggregation steps for your usecases will help you to avoid performance bottlenecks

  1. Setting up Parallelism

Parallel processing is a method in computing of running two or more processors (CPUs) to handle separate parts of an overall task. Parallelism plays an important role in the performance of the role. The number of threads (parallelism level) is defined for the whole plan at once (by the parallelism Level property). Default parallelism level is 1.

Parallelism setting affects all DQC steps of the filter type whereas the complex steps ignore the parallelism setting and always run in an unmodifiable single thread configuration.

  1. How to identify the filter steps?
  • The filter steps usually has a single input, single output and processes individual records
  • Filter steps have Advanced tab with When Condition in their configuration. If the step does not have When condition in the Advanced tab, it is not a filter step. it is inherited behavior
  1. Configuring Parallelism

Parallelism value is set for the whole plan at once. Currently, the maximum number of threads per step is unlimited though it is a good practice not to exceed the number of CPUs in the system.

The parallelism level can be configured on multiple places within the project and is resolved with the following priority:


0 replies

Be the first to reply!

Reply