Best Practice

Setup/Tuning Kafka streaming in MDM

3 years ago
February 14, 2022
0 replies
281 views

konstantinos.patikas
Ataccamer

Introduction

We did an internal performance PoC for Streaming. We are reading messages (“msg”) from Kafka or Java messaging service.

Parameters

In order to tune the performance of the streaming we need to modify 2 main parameters on MDM side:

settings.count = how many messages we need to consume in one single batch
settings.timeout_s = what is the maximum timeout to wait for this number of messages

Average duration of the task vs Records per second

We are not aiming for the throughput (records per second) but we are aiming for the average duration of the task. The goal is to run it under 1 sec (1000ms). In various scenarios we have to increase the settings.count parameter starting from 1, 20, 30 etc.

Scenarios

Below are the different scenarios of using different settings for various tasks/transactions in MDM within 10 mins (600sec)

For example of timeout of 1 sec and settings.count = 1000 (scenario , that means that we try to consume 1000 msg at once within this 1 sec. We will consume whatever we can within the 1sec. What ever is triggered first within the 1 sec it would be consumed and then the processing on the MDM side will be triggered. The 1 sec is the maximum time to wait when fewer messages are available.

Number of processed records

Number of processed records might be larger of the number of input messages. For example 1 msg in the input can produce many records on the MDM side (for example 3 records, party-contact-address). Many records can be consumed on MDM from 1 msg.

Redundant messages

For instance, we have 10 messages from Kafka and maybe 2 or 3 of them would be really consumed (it’s like a huge overhead, we can refer to it as change detection at the msg level) the engine would deduplicate them, virtually increasing the throughput with less work on MDM side for the initial 10 messages.

High settings.count values

The bigger the volume we have on the settings.count (scenario B-E), the better the records per second we get. This is not always the case, example the scenarios below:

settings.count = 15k recs_per_sec_norm = 314 (scenario C)
settings.count = 10k recs_per_sec_norm = 410 (scenario D)

Low settings.count values

For lower values on settings.count (scenarios A & F-J), we beat the average duration of the task with less than 1 sec.

Other factors

Always another factor of the performance is the solution itself (complexity of the model/matching rules etc). In other case, for similar exercise after tuning we end up using settings.count = 5k (initially was 10k) and settings.timeout_s = 60.

Parallel settings in Streaming

Delta parallel threshold and delta parallel settings (nme.delta.*) are not applied in the context of streaming.

Whenever we have streaming is not possible to run the delta detection in parallel.

In streaming, we can not run change detection in parallel. The change detection parameters are not applicable in the context of streaming (they are only in the context of batch loads - full/delta/init). If you apply the change detection parallelism settings they do not get triggered in the context of streaming.

We could theoretically have multiple stream consumers (Kafka or JMS) each one of them reading from different sources and we might want to have different settings (message consumption count, different timeouts) which could also benefit from different parallelism settings.

Only nme.consolidation.* and nme.commit.* settings are applied in the context of Streaming. Those settings are global and we can't define different parallelism settings for each consumer.

The load process is slow with enabled parallelism for very few records

In general, with enabled parallelism, the fewer records in the transaction we have, the slower the process is.

If you enable the parallelism, for mdc the master data consolidation part (cleansing, matching, merging) and for the DB the committing part, it is slower with more parallelism than if it runs with a single thread if you have let’s say less than 500 records across all the entities in 1 transaction. So the threshold makes sense.

It is slower because new threads are created in memory, they have to do the job and then they need to get synchronized. However with 1 thread, it’s already created and doesn’t need to sync that much with others.

With DB committing phase: for let’s say 50 records that we have to commit, with 1 thread, it just reads the data and pushes them to the DB and commits in 1 session. For each thread you have to create a session in the DB so there is some overhead in terms of session creation, then each thread needs to read the data from the temp, insert and commit and then they need to sync with the main control thread.

If we have 1000 records the parallelism would have some effect, with 100 records it just slows it down.

Conclusion

Similar to .perf files we have for executing plans we could have something similar for stream consumers, each stream consumer could be different and we might want to tune it differently. We would have to support different parallel settings for each consumer.

Did this topic help you find an answer to your question?

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Introduction

Parameters

Average duration of the task vs Records per second

Scenarios

Number of processed records

Redundant messages

High settings.count values

Low settings.count values

Other factors

Parallel settings in Streaming

The load process is slow with enabled parallelism for very few records

Conclusion

Reply

Related topics

Ataccama IDE - Postman Api tool integration tutorial?icon

Working with Databases 💿

Data Quality and Pipeline Execution

Working with delta changes in Ataccama ONE

API Usage with Ataccamaicon

Most Liked this week

Sign up

Login to the Ataccama Community

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings