Hi guys, I have a few use cases where I need to perform both row level DQ evaluations (like completeness/format check, etc.) and table level (aggregations on specific fields like sum of sales/count of products etc.) DQ evaluations on the same data set. When I applied both the types of checks on the catalog item, it’s taking a lot of time in running the profile + DQ checks. We are having data volume of few hundred millions (~450Million records). At present I have created two separate sql catalog items one for row level checks and second one for table level checks. I wanted to know if it could be achieved using single dataset without facing any performance issues. Please suggest. Version is 15.4.1
Can we perform row level and table level DQ checks on single data object at the same time?
Best answer by OGordon100
Hi there.
The performance issues you face are likely related to transferring your dataset to be processed.
To explain - if you do not use one of our pushdown processing engines, then data has to be sent somewhere to be processed. If you have a hybrid DPE or are self-hosted, the entire dataset will be sent to a Virtual Machine in your company’s cloud, or if not it will be sent to the Ataccama Cloud. Your administrator can confirm which one applies to you.
It is likely that all 450m records have to be sent to be processed, which takes time. Without shrinking the dataset itself, sticking to sample profiling, etc, this is unavoidable and an inevitable consequence of working with data at scale in the enterprise.
I would suggest you work with your architecture team to increase compute and bandwidth, presuming you truly need all 450m records.
Best of luck!
Reply
Login to the Ataccama Community
No account yet? Create an account
For Ataccama Customers and Partners
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.