Skip to main content

I have been trying to explore the Reprocess and Rematch options in MDM.
From this post I understand that Reprocess is used to reprocess the records. For example, lets say, after an initial load where the email values were not being modified in contact_clean.comp, we are now to determine which emails have the domain company-name.com and are now being updated to have domain companyname.com . So, I need to run the reprocess job.
Rematch is to run the match step on top of the reprocess since the email values have been modified. What I do not understand is do I run a rematch job only if the email value participates in determining the master record or do I need to run it whenever the reprocess jobs is run irrespective of the changes that took place due to a reprocess?
There is another question I have. When we run a load job, the change detection is done right. Does this change detection run on all the records or just the delta records. Post the change detection, matching and merging is run only on the records that changed or all the records?

I am looking for an answer because I scheduled a workflow to load records and export them, assuming that each time any new records are added or old records get modified, the master records will be updated(when there is a change in one of the related instance records) due to change detection. I did not change any code.
But a new record got added and matched to an already existing master record. However, the match_rule that was mentioned in the master record did not justify the matching. I ran a reprocess and rematch job and the new record became a different master record.

I am looking for an explanation on this and also a suggestion on if I need to run the rematch job everytime I am loading records and need to export them for downstream sources?

Hi ​@aish_TF 

This is a good example for how the different processes in an MDM load fit together.

When you perform any load, the system will first perform change detection on all the rows that are supplied. So, if it's a full load then all rows in the data set are checked; if it's a delta load then only the rows in the delta set are checked.

Only those rows that have changed since the previous load are then processed further through the transformation layers.

Reprocessing plans are effectively a special type of load, where all the rows previously loaded to the instance layer (or a specified subset in the case of a partial reprocess) are treated as "changed" for the purposes of carrying on the transformation process. This is particularly useful for cases like yours where the source data is static but enrichment rules in the cleanse plans have changed.

The only rows in the instance layer that will be updated during a load are those that are in this "changed" state.

Reprocessing can be carried out with or without rematching. Any reprocess will perform the cleanse transformation. If the rematch option is selected then it will additionally run the match plan to update a small group of columns including the master_id and the match_rule. 

If data columns are updated in the instance layer during a reprocess cleanse, or the row is rematched, then the merge operation will be carried out to propagate these changes to the master layer. If rematch is selected then the master grouping will be re-evaluated first; otherwise the new existing master group will be retained. Either way, the merge plan will be executed for all instances in the master group.

If your email address value is not used as part of your matching rules then you do not need to run the reprocess with rematch: it will be evaluated as part of the merge plan and written to the master layer if it meets the survivorship criteria.

If the email address is used as part of the match rules then you need either to run a reprocess with rematch or (more likely) add the column to "rematch if changed section" list in the matching tab's Advanced Matching Configuration.

I think what is happening in your process is a result of the way rows are made available for processing during a load versus a reprocess with rematch.

Let's assume you perform a full load where just one new row is added and every other row is unchanged.  Only that single row will be available to be updated. The engine will attempt to match it against the data already loaded. 

When you run a reprocess with rematch, the entire data population is re-opened for matching, and this can lead to different groupings being selected.

The possibility of match groups changing means you need to decide the relative merits for running reprocess with rematch across your whole data set. Generally changes will be limited but it does depend on how your match rules are defined. 

I've never seen MDM match a row incorrectly but the logic for the matching can be confusing: we'd need to inspect the rules to work out why this is.  For instance, a common misconception is that match rules are executed "in order" - so one match rule should take precedence over another one defined lower in the list. This is not the case, as it wouldn't work with the threading model. This means you can't consider a single rule in isolation: you have to consider all defined rules as a set and make sure each is fully deterministic.

Hope that helps! 


Thanks ​@Phil Holbrook for the detailed and well explained response.
I think I have also been believer of the misconception that match rules are executed in order.

I do get what you explained about reprocess and rematch. 
I have also never experienced MDM matching wrongly.
Let me try to explain with an example, what I have witnessed here and probably we can figure out what needs to be changed at the configuration level from my end.
There are two source systems - A and B
two or more records become part of the same matching group if they have the same name and there are two match rules -
1. same_email_id - when a record in system A has the same email as another record in system B
2. same_id - when a record in system A has the same id as another record in system B

As you mentioned, it shouldn’t matter what the order of the rules are, still, I’d like to mention that the rules mentioned above are defined in the same order(as seen above)

In the first load, one record each from system A and B have same name, same email and same id. So, they match, merge and make a master record(lets say with master record id 1) that has a match rule name of 2nd match rule.
Now, in the next full load, there is a new record added in system A that has the same name as the master record(id1) created. There is no change in the original two records that participated in formation of master record id 1. I do not run reprocess or rematch steps (there is no change in transformation logic for email as well). The new record becomes a part of the same master record(id 1) and the match rule name is of 2nd match rule. 

Anyway I run a rematch, because I do not expect the newly added record to be a part of this master record. Post that, it becomes an independent master record and the other master record is as it was before the new record was loaded.

For sure I am lacking some understanding and would like to clear my concepts here.
Also, I had scheduled a job to load the records(which would run match on any new records to either make it a part of an already existing master or make a new master when it does not satisfy any match rule). But now with this scenario I am not sure if I should also trigger a rematch job post load in the scheduled workflow.

Please let me know if you need anymore information.

Thank you


I can’t work out what’s happening from your description ​@aish_TF. Could you upload your match plan and some dummy data?

It shouldn’t be necessary to schedule a rematch after a load unless new incoming data is likely to change the pre-existing groups - which will generally only happen with more “fuzzy” match criteria.


Hi ​@aish_TF just want to check if ​@Phil Holbrook's reply was helpful to your question? Please let us know if you have any follow up questions you might have🙋🏻‍♀️


Hi ​@Phil Holbrook ​@Cansu 

Apologies for replying so late.


Thank for your reply. I have found the reason behind my issue with Oliver Kerul-kemec’s help. 
Unfortunately, I cannot upload the configurations, but I will try to replicate the scenario or use an example to explain it in private to Cansu in a few days and if it feels worth it, we can publish on community, because I feel it is an interesting scenario.

Thank you for the answers, they did get me to understand a lot about rematch and reprocess jobs.

Regards,
Aishwarya


That would be a great asset for the community ​@aish_TF thank you for being willing to share 🙌


Reply