Using Group Aggregator to check column uniqueness


(Mike Roberts) #1

Hello all, I have a need to build a rule that checks for uniqueness for a particular column and have tried a couple of methods using Record Descriptor Builder and (now) Group Aggregator. The Record Descriptor Builder rule (Testing for uniqueness for a field) works “except” it shows multiple rows for a given invalid result so that not ideal for the dashboard. **I attempted to use the “Filter” to remove the duplicates but Ataccama didn’t like inconsistent input vs output results.

Which leads me to trying Group Aggregator. My goal was to group my column and flag a record as “invalid” where the count is greater than 1 (not unique). My issue is that it doesn’t allow the required “result” shadow column (from the input) to flow down through Group Aggregator to the output. I’ve tried to use “Alter Format” to add the “result” column but receive errors when running the rule.

Here’s what my rule looks like:
rule_wf

Rule output:

Has anyone else used a similar rule building method or created a different rule to check for uniqueness?

Thanks in advance,
Mike Roberts


Unique value check within the Data Quality Inidcator
(Radek Šedina) #2

Hello Mike,

After my investigation, I would recommend you using representative creator step for your use-case.

It allows you to set up rules for instance attributes, where you can use the expression for grouping, as per:

I have used group.size=1 as a grouping expression which groups the records with the same id and shows them as false, i.e. not unique.

I am attaching an example plan of the solution.
representative_creater.plan (3.9 KB)

Let me know if you have any more questions.

Best regards,

Radek.


(Mike Roberts) #3

Thanks for responding Radek. I created a rule using that working method and it gave me essentially the same duplicated outcome as using the Record Descriptor Builder, which I’m ok with (works but not ideal). Ultimately, I would love for Ataccama to allow me to filter out the duplicates but by design seems to get mad when # of inputs don’t match # of outputs.

**I wasn’t aware of how the Representative Creator worked so that’s for the example. :slight_smile:

Have a great day,
Mike


(Amey Gupte) #4

Hello all,
This is very helpful, thank you. Is there a way you can merge the non-unique rows to eventually output a file / table with all unique rows?


(Maksim Zhelyazkov) #5

For anyone wondering, you can use the Filter step after the Representative Creator step to create an output table with only unique or duplicate records.

Best regards,
Maksim