Complex Duplicate Rule

Question

I need to create a complex duplicate check rule. I am familiar with the aggregation rule; however, in this case it won’t work for me. Here is the logic to determine duplicates:

If the name, unit of measure, location, and type are all the SAME but the code (not a PK) is DIFFERENT then it is considered a duplicate.

These would be considered duplicates:

Code	Name	UOM	Location	Type
123	Trek	LBS	Houston	Bike
234	Trek	LBS	Houston	Bike

This is not considered a duplicate because we can extend products to different locations instead of creating a new row:

Code	Name	UOM	Location	Type
678	Razor	LBS	Houston	Scooter
678	Razor	LBS	San Antonio	Scooter

When I write the rule, I use the aggregation rule to select name, uom, location type is not unique; however, that can give me false positives for the scenario above (Code = 678). I don’t see a way to say these fields are the same while this field is different. I am doing this in one web; however, I’m open to writing something in the desktop if there is a way to do this. Any help would be appreciated.

ThomasPro · Answer

I have been thinking through this and could be an easier way to catch these records rather than creating a complex data quality rule. You could create a custom sql catalog item that will only bring in records where duplicates exist and then apply a rule that checks for record counts. This way whenever this appears the rule will immediately fail, and you will be able to view those records.

Example query you could use in your catalog:

SELECT *

FROM your_table t

JOIN (

SELECT Name, UOM, Location, Type

FROM your_table

GROUP BY Name, UOM, Location, Type

HAVING COUNT(*) > 1

) dup

ON t.Name = dup.Name

AND t.UOM = dup.UOM

AND t.Location = dup.Location

AND t.Type = dup.type

Sign up

Login to the Ataccama Community

Scanning file for viruses.

This file cannot be downloaded