Ataccama One Web:How to handle large dataset with 1 B rows in the data set while working on data profiling & Data quality evaluation? (Best Practices) What is expected Infrastructure configuration required with respect to infrastructure in case such huge datasets?

Solved

Handling Large Dataset while data profiling & Data Quality

3 months ago
April 30, 2025
2 replies
37 views

A

+1

akshayl09
Data Pioneer
27 replies

Ataccama One Web:

How to handle large dataset with 1 B rows in the data set while working on data profiling & Data quality evaluation? (Best Practices)
What is expected Infrastructure configuration required with respect to infrastructure in case such huge datasets?

Best answer by ekaterina.ponomareva

Hi @akshayl09 !

Thank you for your question.

You would indeed need to consider several things in the architecture when working with large data sets.

Scalability: DPE (Data Processing Engine) can be scaled horizontally to handle increasing data volumes.
Resource Allocation : Make sure memory is scaled appropriately.

Monitor running tasks, identify performance issues or failures. Review, change, and adjust the configurations. There is no one perfect configuration, you need to tweak for your process.

It will also depends heavily on the data source, e.g. it is possible to use ‘Pushdown’ feature for some of the data sources. It means executing data quality operations (profiling, rule evaluation) directly within the Big Data source (e.g., Snowflake, Databricks, Synapse). Ataccama ONE Gen2 translates data quality configurations into native SQL (or equivalent) and executes them on the platform.

Benefits:

Reduced Data Transfer: Minimizes data movement, saving bandwidth and time.
Leverages Platform Resources: Utilizes the processing power and scalability of the Big Data platform.
Improved Performance: Faster execution due to optimized processing within the data source.

Also, in case you wanted to run Monitoring Project only on top of some part of the data set, you could take advantage of data slicing feature -https://docs.ataccama.com/one/latest/catalog-items/create-data-slice.html

Hope this helps! Let me know if you have any further questions.

Kind regards,
Ekaterina

View original

Did this topic help you find an answer to your question?

E

ekaterina.ponomareva
Ataccamer
42 replies
Answer
3 months ago
May 9, 2025

Hi @akshayl09 !

Thank you for your question.

You would indeed need to consider several things in the architecture when working with large data sets.

Scalability: DPE (Data Processing Engine) can be scaled horizontally to handle increasing data volumes.
Resource Allocation : Make sure memory is scaled appropriately.

Monitor running tasks, identify performance issues or failures. Review, change, and adjust the configurations. There is no one perfect configuration, you need to tweak for your process.

It will also depends heavily on the data source, e.g. it is possible to use ‘Pushdown’ feature for some of the data sources. It means executing data quality operations (profiling, rule evaluation) directly within the Big Data source (e.g., Snowflake, Databricks, Synapse). Ataccama ONE Gen2 translates data quality configurations into native SQL (or equivalent) and executes them on the platform.

Benefits:

Reduced Data Transfer: Minimizes data movement, saving bandwidth and time.
Leverages Platform Resources: Utilizes the processing power and scalability of the Big Data platform.
Improved Performance: Faster execution due to optimized processing within the data source.

Also, in case you wanted to run Monitoring Project only on top of some part of the data set, you could take advantage of data slicing feature -https://docs.ataccama.com/one/latest/catalog-items/create-data-slice.html

Hope this helps! Let me know if you have any further questions.

Kind regards,
Ekaterina

A

Ataccama Community Admin
Intergalactic Expert
704 replies
3 months ago
May 12, 2025

Hi @akshayl09, I’m closing this thread for now. If you have any follow up questions please feel free to share them in the comments or create a new post 🙋🏻‍♀️

Check out our Quick Start Guide to get started on the community 🙋‍♀️

Handling Large Dataset while data profiling & Data Quality

2 replies

Reply

Most Liked this week

Cookie policy

Cookie settings

Reply

Related topics

News! Ataccama ONE v16.2 brings AI to data lineage to help business users understand and trust their data

Profiling Triggered While Running Monitoring Project

ONE Desktop: Running Plans 🏃

How to tune/optimize a DQC load plan?

Version 16 is here! Updates to ONE AI, Data Observability, MDM, and more 👀

Most Liked this week

Sign up

Login to the Ataccama Community

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings