Skip to main content

Hii folks,

We have a requirement to do DQ evaluation on a table that has few billions of records (> 10 billions). We have to perform various checks on this table that includes aggregations for uniqueness check, completeness, data format etc. Making a subset of data also doesn’t help in reducing volume to desirable/executable range as it still has total count in billions.

What would be the ideal way to to DQ on such tables? Your inputs are highly appreciated here. Thanks! 

@anna.spakova ​@Lisa Kovalskaia  any suggestions here?

Hi ​@mp_ataccamauser ,

thank you for your question. If you need the whole dataset to perform the DQ checks, I don’t think there is much that we can do.

You don’t mention the data source technology here, but in case you don’t need the invalid records export, you might want to consider pushdown.

Other way I can think of is to at least prepare some of the more performance-demanding operations in your source system instead of Ataccama, esp. the aggregations will always be slower in Ataccama I think. As an example, if you are checking uniqueness, you can do the computation in the source system and create an additional column that would store e.g. the size of the group for each record. Ataccama would then just evaluate this column with a simple DQ check (> 1 → INVALID).

Another question could be - is partitioning possible? Again, depending on the technology and use-cases, you mention that you don’t see a way how to do subsets, so probably not a good idea. Esp. for the aggregations, it is probably not an option, but that could be done in the source, and then in Ataccama you can split the data to smaller chunks or even try the whole dataset with just the simple checks. The question could also be what is the data - billions of data that need to be unique is quite interesting.

Btw in general, Ataccama still should handle it, but make sure that you have sufficient memory on the tmp storage for DPE and disk space. 

Perhaps Lisa will suggest some more options.

Kind regards,

Anna


Thanks for the quick response ​@anna.spakova we are having snowflake database in this scenario. Also, uniqueness is mainly checked for combination of multiple fields (6 fields to be precise) being unique on certain condition. Do you think it would help if we take DQ fields only in source db and create an intermediate table/view that is pulled as catalog item for DQ in Ataccama.