Solved

DQC match - slow incremental load

  • 27 February 2023
  • 6 replies
  • 82 views

Userlevel 2
Badge +2

The matching component is slow at the extended unifications step for a incremental load. When a record like ‘WALGREEN%’ is loaded and there are 20K records with similar name on the instance, the records are all considered through the process of matching. Hence the process is slow when processing the increment and even if the incremental load has 11 records to process 30K records have the transaction Id and omni_modified date updated with the work order related on the instance table.  When I check history, nothing changes on these records except the omni_modified_date and transaction_id to that of the work order. Tweaking the union keys to create combination of name and key factors or adding separate groups for the high count records is not helping. Any thoughts or suggestions?

icon

Best answer by AKislyakov 28 February 2023, 14:21

View original

6 replies

Userlevel 4
Badge +4

Hi,

Usually, the solution is indeed to introduce additional components to union keys to minimize number of records fetched from the repository on expansion phase.

  • Can you share execution logs, preferably with Increased unification step logging level? (To increase unification step logging level set: unify.verbose=4 parameter.)
  • Also, you can check cardinality of resulting keys, by querying <REPO>_keys table, to ensure that introduction of new key factors reduced the cardinality.
  • Finally, what version are you using? and how many operations are there in the Extended Unification step?​​
Userlevel 2
Badge +2

Appreciate your response. I did see improvement after adding a new union key.

I will check if any cardinality and identify new keys.

Version we are using Omni 3.16 and DQS 12.6.3.

Userlevel 2
Badge +2

I tried to look into repo_keys, When you say cardinality you mean the counts should be low when


  SELECT pk
  ,op
      ,left([ukey],3)
      ,count(*)
  FROM [omnirepo].[dbo].[repos_providerpractice_keys]
  group by  pk
  ,op
      ,left([ukey],3)
      having count(*) >1

Is there a guide describing different fields on the repo table?

Thank you again.

 

Userlevel 4
Badge +4

Unfortunately, there is no guide. But I find it quite straightforward (take it with a grain of salt, I’m not the developer):

  • pk -- primary key of the record.
  • op -- operation #, in case you have multiple unify operations inside a single step.
  • ukey -- unification key, all parts of the union key, combained in a signle string
  • gid -- the resulting Candidate Id of the record
  • center -- flag that the record was selected as a center of a group.

When I said cardinality, I meant something like that.

The first one shows how many records are in your largest candidate groups, i.e., how many records might be affected, if a new record arrives. And the second one, shows most frequent keys, it what you usually have to change, to make the candidate groups smaller.

select gid, count(*) group_rec_count
from <REPO>_keys
group by gid order by 2 desc

 

select ukey, count(*) key_rec_count
from <REPO>_keys
group by ukey order by 2 desc
Userlevel 2
Badge +2

Thank you for the detailed explanation, it really helps.

I added a union key which helped with the performance. I had to adjust the matching rules based on the new union key as it was a new field. I am currently testing.

 

Userlevel 2
Badge +2

Thanks a ton, it was helpful. Except I am trying to resolve the bundle deployment issue which always an issue. I will create a different topic for it to discuss.

Reply