Solved

Effects of removal of already loaded attribute data from master match

9 months ago
September 16, 2024
10 replies
80 views

aish_TF
Star Blazer L1

I would like to know what will be the effects in mdm web application if I make the following changes.
Here goes the scenario:

Let’s say I am using email attribute in my match logic of contact(silver) entity and have already loaded the email records into my mdm-hub.
But now, I do not want to use email information from 1 of 3 source systems while determining the contact(silver) master. I think the option is to not load email information from the said system.
Will the email information (from the records that were loaded previously from system 1 )still show up after I run a reprocess job( after loading the records from system 1 without any email information.)?

Best answer by Phil Holbrook

If you exclude the email address from one source system, either by not supplying it in your source data or by excluding it from the column mapping in your load plan, then the system will regard that as a data change in the change detection phase and will overwrite the email address in your instance layer with a blank value.

So the email address from that source system will not be retained if you perform a full load of the data from that source system.

Note that a reprocess alone will not reload the data - you need to run an actual load of every row for which you want the email address excluded.

When you run a reprocess (with rematch - important!) then the match rules will be applied and the email address will not be present to be used in the match.

Depending on your match logic, you may want to consider setting a no key condition for the email address match component key for that specific source system (eng_source_system = 'source_bad_emails') which would allow you to retain the email address but exclude it from the match logic. You can then decide whether to keep that email address as a fall-back in the merge plan if there is no better address available

View original

Did this topic help you find an answer to your question?

Phil Holbrook
Ataccamer
9 months ago
September 17, 2024

So the email address from that source system will not be retained if you perform a full load of the data from that source system.

Note that a reprocess alone will not reload the data - you need to run an actual load of every row for which you want the email address excluded.

When you run a reprocess (with rematch - important!) then the match rules will be applied and the email address will not be present to be used in the match.

aish_TF
Star Blazer L1
9 months ago
September 19, 2024

Thanks @Phil Holbrook
I have a question in continuation. Currenlty(before making the suggested changes), I can see the email value from all the source systems under Contact section when I open a master party record detail page. However, I do not want email value from the said system be visible there. Will the No Key Condition (eng_source_system = 'source_bad_emails') change in contact_match component alone help me in achieving this or do I have to not load email information from the said system to achieve this?

Phil Holbrook
Ataccamer
9 months ago
September 19, 2024

To understand this, it helps to think about what the different MDM layers and transformations are actually doing.

The load plan determines what will be stored into the src_ columns of the instance layer - so you can stop your suspect email addresses from ever entering the system by not loading them: populate src_email with null in that specific load plan. The down-side of this is you're losing the ability to review what data the source system holds at all.

The cleanse plan determines what data goes into the std_ or cio_ columns, the cleaned (and possibly enriched) version of your source data. You could set cio_email to null when populating from your src_email column during this transformation with a statement like,
iif(eng_source_system == 'source_bad_email', '', src_email)

The match plan has no effect whatsoever on what data is stored. It's only purpose is to find instance records that should be used to contribute to the same master record.

If you make the no-key condition change then the suspect email row will not be matched and will be assigned its own distinct master_id. If you do nothing else then it will create a separate master row.

The merge plan dictates what you store into the master layer - your data survivorship. You should always be using cleansed data for this - so std_ or cio_ columns, not src_ - but you can control which instance rows are used to supply each column for the master record.

So when creating a single master row from several matched instance rows you can include a rule in your merge plan representative creator to ignore the data from any particular source system.

The other thing I think you might be trying to achieve here is to create multiple contact rows linked to each party row, but then exclude the email value when creating the contact from that particular source system.

If you are trying to block instance rows from creating a contact master row at all for that source system then it is acceptable to filter rows out of the merge plan: unlike match and cleanse plans there's no requirement for all the rows entering a merge plan to contribute to the master layer output. So you could add a filter step before the representative creator with the condition
eng_source_system <> 'source_bad_email'
to only pass rows from other systems when creating contacts.

That only helps if there's no other information you want from the instance record of course. If you do want the record, but just to exclude the email then you can use the representative creator to set that particular column to null for that particular eng_source_system when assigning the value for the master layer column, cmo_email.

aish_TF
Star Blazer L1
9 months ago
September 20, 2024

Thank you @Phil Holbrook for explaining the concepts so well.
I do not want to consider email records coming in from said source system but do want to consider the phone numbers to determine a contact master(silver) record and the model I am using is the same as the out-of-box MDM Example Project. So the contact_clean gives me two columns - cio_contact_type and cio_contact_value which has all the phone and email values where cio_contact_type =’ PHONE’ when cio_contact_value is the phone number value and cio_contact_type =’EMAIL’ when cio_contact_value is email value. So, I am considering to set ‘No Key Condition’ to

eng_source_system ='bad_email_source' and cio_contact_type='EMAIL'

in the contact_match component. This way, I think that the phone number from the said system will be considered for match but not the email. Only thing I am not sure about is if a master_id will still be created for the email values loaded from the said system. I don’t think it should. But please let me know if I am wrong to think so.

Thank you again for taking the efforts and explaining so well. Appreciated!

Phil Holbrook
Ataccamer
9 months ago
September 20, 2024

OK @aish_TF - yes: that looks like it should work.

What should happen (if I’ve understood your use case correctly!) is that each row of email addresses from ‘bad_email_source’ will be assigned a unique master id, not shared with any other row (ie not ‘matched’ )

As I mentioned, if you do nothing else then these rows will also create master rows. Filter them out in the merge plan using the same logic you are suggesting for your no key condition - ie, add a filter step before the representative creator with the expression

NOT( eng_source_system ='bad_email_source' and cio_contact_type='EMAIL')

so that only the rows you want to keep are selected to pass through.

It’s fine to have master_id values on instance layer rows that do not match to id values in the master layer, and the master layer rows are only created when populated by the merge plan.

aish_TF
Star Blazer L1
9 months ago
September 24, 2024

Hi @Phil Holbrook
I made the following changes:
added a No-Key Condition eng_source_system = 'bad_email_source' and cio_contact_type='EMAIL' in contact_match component
and
added a Filter step before Representative creator step in contact_merge step with the condition NOT(eng_source_system = 'bad_email_source' and cio_contact_type='EMAIL').
Post the changes, I restarted the server and ran a rematch_reprocess job for all entities.

I am still having the ‘bad_email_source’ Email values chosen as master.contact representative i.e. cmo_contact_value for contact master is being picked from the bad email source.

Could you please help me understand why this would be happening?

Phil Holbrook
Ataccamer
9 months ago
September 24, 2024

Hi @aish_TF

Double check that you have substituted the appropriate source system name for “bad_email_source” in your conditions - this is case sensitive and will need to be the same as the value assigned to source_system in the load plan’s map_internal_columns_* step.

It sounds like these rows are still being matched: you should be able to use the MDM WebApp to select all instance rows with master_id values that match the id of your master record and compare the values.

aish_TF
Star Blazer L1
9 months ago
September 25, 2024

Hi @Phil Holbrook
I can compare the instance records based on master_id but I do not have the source_system information there. And in my scenario two systems have the same EMAIL value.
I also think it could be the first reason.
But I have another observation here - In the current state of my mdm hub, there many master contact records that have only one candidate email and that is from the bad_email_source. I see these email values showing up as the master contact. Ideally this should have not happend with the changes made.

Phil Holbrook
Ataccamer
9 months ago
September 25, 2024

Hi @aish_TF

The source system can be a little confusing - and I'm not sure I helped much in the last post: sorry!

I've checked this through on a vn 14.5.0 install but it hasn't changed since before 13.x.

The instance layer column eng_source_system is labelled [System] in the WebApp, and is populated with the name of the Connected System as it appears in the Model Explorer. The source_system column in the load plan is unused.

aish_TF
Star Blazer L1
9 months ago
September 26, 2024

@Phil Holbrook
I am happy to report that the issue indeed was with name of system as I was filtering based on source_system value in load plan. I made the changes according to name in Connected System and it works perfectly fine.

Thanks a lot for working patiently with me throughout this issue.

Regards,
Aishwarya

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Reply

Related topics

Follow this Journey to Data Governance 👣

How to Use Data Governance for AI/ML Systems - Digest 13🗞️

Data Governanceicon

[Panel Discussion] Data Governance for Data Use

Data Governance & why we need it

Sign up

Login to the Ataccama Community

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings