Skip to main content

Dear members,

I am struggling with a step that rather reflects query/array logic than the logic of steps available in component editor.

Task is to find the maximum from a range of values. I tried to retrieve it through a dynamic query written in a Representative Creator step, see details below:

L1:= round((1-(levenshtein(W_CUA1,W_CUA1_MATCH)/(max(length(W_CUA1),length(W_CUA1_MATCH)))))*100,0);
L2:= round((1-(levenshtein(W_CUA1,W_CUA2_MATCH)/(max(length(W_CUA1),length(W_CUA2_MATCH)))))*100,0);
L3:= round((1-(levenshtein(W_CUA1,W_CUA3_MATCH)/(max(length(W_CUA1),length(W_CUA3_MATCH)))))*100,0);
L4:= round((1-(levenshtein(W_CUA1,W_CUA4_MATCH)/(max(length(W_CUA1),length(W_CUA4_MATCH)))))*100,0);
L5:= round((1-(levenshtein(W_CUA2,W_CUA1_MATCH)/(max(length(W_CUA2),length(W_CUA1_MATCH)))))*100,0);
L6:= round((1-(levenshtein(W_CUA2,W_CUA2_MATCH)/(max(length(W_CUA2),length(W_CUA2_MATCH)))))*100,0);
L7:= round((1-(levenshtein(W_CUA2,W_CUA3_MATCH)/(max(length(W_CUA2),length(W_CUA3_MATCH)))))*100,0);
L8:= round((1-(levenshtein(W_CUA2,W_CUA4_MATCH)/(max(length(W_CUA2),length(W_CUA4_MATCH)))))*100,0);
L9:= round((1-(levenshtein(W_CUA3,W_CUA1_MATCH)/(max(length(W_CUA3),length(W_CUA1_MATCH)))))*100,0);
L10:= round((1-(levenshtein(W_CUA3,W_CUA2_MATCH)/(max(length(W_CUA3),length(W_CUA2_MATCH)))))*100,0);
L11:= round((1-(levenshtein(W_CUA3,W_CUA3_MATCH)/(max(length(W_CUA3),length(W_CUA3_MATCH)))))*100,0);
L12:= round((1-(levenshtein(W_CUA3,W_CUA4_MATCH)/(max(length(W_CUA3),length(W_CUA4_MATCH)))))*100,0);
L13:= round((1-(levenshtein(W_CUA4,W_CUA1_MATCH)/(max(length(W_CUA4),length(W_CUA1_MATCH)))))*100,0);
L14:= round((1-(levenshtein(W_CUA4,W_CUA2_MATCH)/(max(length(W_CUA4),length(W_CUA2_MATCH)))))*100,0);
L15:= round((1-(levenshtein(W_CUA4,W_CUA3_MATCH)/(max(length(W_CUA4),length(W_CUA3_MATCH)))))*100,0);
L16:= round((1-(levenshtein(W_CUA4,W_CUA4_MATCH)/(max(length(W_CUA4),length(W_CUA4_MATCH)))))*100,0);

max(L1,L2,L3,L4,L5,L6,L7,L8,L9,L10,L11,L12) // and here comes the trouble: Ataccama has no formula to pick maximum from an N number range but from 2 sources at a time.

Independently from above logic, the case can be simplified as below:

L1:= levenshtein(W_CUA1,W_CUA1_MATCH)   //everything else is just custom calculation, we can restrict the case to basics

L2:= levenshtein(W_CUA1,W_CUA2_MATCH)

L3:= levenshtein(W_CUA1,W_CUA3_MATCH)

etc.. to L16.

It is already annoying that there is no loop function in code writer function, but having no chance to set a simple rank or choose a maximum NOT from a certain column but across different variables is something I cannot believe.

Could anyone help me out with a solution?

What I want finally is:

  1. to get the maximum match
  2. to get the name of two fields that provide the highest match, such as ‘W_CUA3,W_CUA1_MATCH’.

Thank you in advance!

Hi ​@Lazov.

Thanks for the question!

With Ataccama plans/components it’s important to understand that the majority of steps are multi-threaded. In this case there is no guarantee of the order that records will traverse through the plan/component.

There are some steps that force the flow to wait until records are available. This includes the Representative Creator step that you mentioned.

I’m not 100% clear on the requirement here but am hoping that the attached example plan might help (see community_dq_1627.zip).

Feel free to explain the requirement further or ask any follow up questions if this isn’t what you were trying to achieve.

Kind regards,
Adrian

 


Hi Adrian,

thank you indeed for your kind answer!
You are on the right track, despite my explanation bleeding from many wounds. :-)
It is a little more complex, as I am playing hide-and-seek with a value that can be stored in four fields: W_CUA1, W_CUA2, W_CUA3, W_CUA4.
So each W_CUA* must be matched to the other W_CUA* in each pair of records.

The whole picture is as follows:
1. I join a dataset back  to itself, based on GROUPING_ID.
GROUPING_ID covers an n number of records, belonging together by identical value in a certain field - it is indifferent what field it is, my point is that I have already done the grouping successfully.
So now, in the same row, I have the in_left.W_CUA1... to in_left.W_CUA4 and in_right.W_CUA1... to in_right.W_CUA16. For simplification's sake I apply name in_right.W_CUA1 -> W_CUA1_MATCH etc. to W_CUA4_MATCH.

2. I remove the pairs where in_left.ID = in_right.ID where ID is the record ID within the group.

3. In a Representative Creator, I remove the combinations where the pairs are the same i.e. left ID appears on right join and vice versa, eg. 4 vs 5, and 5 vs 4.

4. In the Representative Creator, I set a dynamic query to check the similarity between the all 16 combinations - that is the code in my first post.
Then the desired result would be to find the highest match, indicating which fields are attached - data comes from W_CUA1 vs. W_CUA3_MATCH etc.

 

First screenshot demonstrates the component flow I built, while second one is the case of the field matching through a life-like example.

And I still have no clue how to present in what form I want to see the results i.e. how to mark which fields bear the highest rank combination. Another field should store this information eg. ‘W_CUA1|W_CUA3_MATCH’.

And yes, your question is relevant how to rank the results when we have eg. 3 pairs with the same level match - I don’t know yet, but learning the database, there is zero chance of such case, due to the nature of the data inspected here.

Is this clear I wrote or I just made it a bigger mess? I hope I did not.

 

Thank you, Laszlo


Hi Adrian, there’s a mistake I made here, but the page does not save the text again when I edit and send it.

Instead of this:

‘So now, in the same row, I have the in_left.W_CUA1... to in_left.W_CUA4 and in_right.W_CUA1... to in_right.W_CUA16

I meant this:

‘So now, in the same row, I have the in_left.W_CUA1... to in_left.W_CUA4 and in_right.W_CUA1... to in_right.W_CUA4

 

Sorry for confusion,

Laszlo


Hi Laszlo.

Apologies for not responding sooner. There are a couple of issues with the Community site at the moment and I’m not able to see the images shared previously.

Could you perhaps try sharing them again and/or attaching a copy of any shareable plans/components?

Thanks,
Adrian


Hi ​@Adrian Anderson,

no need to apologize, I am grateful that you are dealing with this challenge I have.

I rather embed the same here in the text box - sharing the whole thing would not meet my agreement with my customer on data handling. Building it in a demonstrative test component would take more time than I have at the moment.

But certainly I will do so if your feedback was that you could not see a thing again. :-)

So now, let’s try again.

First screenshot demonstrates the component flow I built, while second one is the case of the field matching through a life-like example.

 

 

 

Please let me know whether you can see the images this time.

Thank you, Laszlo


Reply