Skip to main content
Question

Duplicate Record in the table based on all the columns


Forum|alt.badge.img

Is there a method to create a single rule for identifying duplicate records based on all columns in a table? This rule should be applicable to all tables, regardless of the number of columns they contain.

Current situation we need to add the column to use aggregated rule. It is very difficult if the columns are more than 10 or 15 in table. 

 

Did this topic help you find an answer to your question?

Lisa Kovalskaia
Ataccamer
Forum|alt.badge.img+3

Hi ​@akshayl09, the only option I can think of is to create an SQL CI or a VCI which would present a single column with the concatenated values of the source columns. The uniqueness rule would be very simple then, but of course there would be some additional effort at the data preparation stage. Whether or not this approach is reasonable also depends on where you do data prep, e.g. in the source system or in Ataccama. If you already use SQL CIs/VCIs to preshape your source data for DQ, then you might as well add the composite key column, but if you generally apply rules to source tables then I'd rather keep the complexity in the aggregation rules. 

I\m curious what is the use case beyond identifying these duplicates? It sounds like this is a very common issue in the data, why does it occur and how do you work with it after getting DQ results? Thanks for sharing!


Forum|alt.badge.img
  • Universe Traveller
  • March 18, 2025

Yes. Use case is to find out duplicates on fact tables which mostly has composite keys made of many columns. There are some of the tables which 15-20 columns in the table. In Sql, it can be quickly done. select count (*), count (distinct (*)). Creating SQL CI is option, but it has to create for all the tables which duplication of tables. 

Although, we tried to create the common rule based on composite key in table, many tables have different combinations (datatype) of the composite key, so we need to create separate rule for each table. Just a thought about re-usability of the rule which are not dependent data type so that one rule can be used re-used for any combination of datatype. Each table has different number of columns to form composite key sometime is 2 or 3 or 10 as well so duplicate checks need to be created for each of the table. If there is something table level rule. It might be helpful to avoid the duplication of rules. Same rule can be re-purpose for all tables. 


Lisa Kovalskaia
Ataccamer
Forum|alt.badge.img+3

@akshayl09 sounds like creating SQL CIs for just this one purpose may not be the right approach for you. In that case you may need to continue with data-specific / non-reusable uniqueness rules - sometimes this is a necessary tradeoff to support certain validation requirements.

From what you described, in some tables composite keys are a combination of all columns, while in other tables, only a subset of columns make up the composite keys, and on top of that there's variation in data types. All in all, I think adding a generic product feature to make this sort of uniqueness rule reusable would be a significant change from how rules work at the moment. Nevertheless, if you have a moment please share your idea on our dedicated Community space for product feedback: https://community.ataccama.com/ideas. This is where our product team draw insights on customer needs and your suggestion may gain inputs and upvotes from other users.

Thank you!


Forum|alt.badge.img
  • Universe Traveller
  • March 19, 2025

Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings