Can we perform row level and table level DQ checks on single data object at the same time?

Question

Hi guys, I have a few use cases where I need to perform both row level DQ evaluations (like completeness/format check, etc.) and table level (aggregations on specific fields like sum of sales/count of products etc.) DQ evaluations on the same data set. When I applied both the types of checks on the catalog item, it’s taking a lot of time in running the profile + DQ checks. We are having data volume of few hundred millions (~450Million records). At present I have created two separate sql catalog items one for row level checks and second one for table level checks. I wanted to know if it could be achieved using single dataset without facing any performance issues. Please suggest. Version is 15.4.1

OGordon100 · Accepted Answer

Hi there.The performance issues you face are likely related to transferring your dataset to be processed.To explain - if you do not use one of our pushdown processing engines, then data has to be sent somewhere to be processed. If you have a hybrid DPE or are self-hosted,the entire dataset will be sent to a Virtual Machine in your company’s cloud, or if not it will be sent to the Ataccama Cloud. Your administrator can confirm which one applies to you.It is likely that all 450m records have to be sent to be processed, which takes time. Without shrinking the dataset itself, sticking to sample profiling, etc,this is unavoidable and an inevitable consequence of working with data at scale in the enterprise.I would suggest you work with your architecture team to increase compute and bandwidth, presuming you truly need all 450m records.Best of luck!

mp_ataccamauser · Answer

​@OGordon100 Thanks for your inputs, on table level validations I am mostly using aggregate rules which I think don’t support pushdown. I have this requirement with high priority. .e.g.,I need to validate that at least one record with max of updated_dt = today should be present in my data set, some columns should always have non-null data. The smallest dataset being 4Millions takes hours when I apply these DQ rules together.

Sign up

Login to the Ataccama Community

Scanning file for viruses.

This file cannot be downloaded