Skip to main content
Solved

Handling Large Dataset while data profiling & Data Quality


Forum|alt.badge.img+1

Ataccama One Web:

  1. How to handle large dataset with 1 B rows in the data set while working on data profiling & Data quality evaluation? (Best Practices)
  2. What is expected Infrastructure configuration required with respect to infrastructure in case such huge datasets?

Best answer by ekaterina.ponomareva

Hi ​@akshayl09 !

Thank you for your question.

You would indeed need to consider several things in the architecture when working with large data sets.

  • Scalability: DPE (Data Processing Engine) can be scaled horizontally to handle increasing data volumes.
  • Resource Allocation : Make sure memory is scaled appropriately. 

Monitor running tasks, identify performance issues or failures. Review, change, and adjust the configurations. There is no one perfect configuration, you need to tweak for your process.

It will also depends heavily on the data source, e.g. it is possible to use ‘Pushdown’ feature for some of the data sources. It means executing data quality operations (profiling, rule evaluation) directly within the Big Data source (e.g., Snowflake, Databricks, Synapse). Ataccama ONE Gen2 translates data quality configurations into native SQL (or equivalent) and executes them on the platform.

Benefits:

  • Reduced Data Transfer: Minimizes data movement, saving bandwidth and time.
  • Leverages Platform Resources: Utilizes the processing power and scalability of the Big Data platform.
  • Improved Performance: Faster execution due to optimized processing within the data source.
     

Also, in case you wanted to run Monitoring Project only on top of some part of the data set, you could take advantage of data slicing feature -https://docs.ataccama.com/one/latest/catalog-items/create-data-slice.html

Hope this helps! Let me know if you have any further questions.

Kind regards, 
​​​​​​​Ekaterina

 

 

View original
Did this topic help you find an answer to your question?

2 replies

Hi ​@akshayl09 !

Thank you for your question.

You would indeed need to consider several things in the architecture when working with large data sets.

  • Scalability: DPE (Data Processing Engine) can be scaled horizontally to handle increasing data volumes.
  • Resource Allocation : Make sure memory is scaled appropriately. 

Monitor running tasks, identify performance issues or failures. Review, change, and adjust the configurations. There is no one perfect configuration, you need to tweak for your process.

It will also depends heavily on the data source, e.g. it is possible to use ‘Pushdown’ feature for some of the data sources. It means executing data quality operations (profiling, rule evaluation) directly within the Big Data source (e.g., Snowflake, Databricks, Synapse). Ataccama ONE Gen2 translates data quality configurations into native SQL (or equivalent) and executes them on the platform.

Benefits:

  • Reduced Data Transfer: Minimizes data movement, saving bandwidth and time.
  • Leverages Platform Resources: Utilizes the processing power and scalability of the Big Data platform.
  • Improved Performance: Faster execution due to optimized processing within the data source.
     

Also, in case you wanted to run Monitoring Project only on top of some part of the data set, you could take advantage of data slicing feature -https://docs.ataccama.com/one/latest/catalog-items/create-data-slice.html

Hope this helps! Let me know if you have any further questions.

Kind regards, 
​​​​​​​Ekaterina

 

 


Cansu
Community Manager
Forum|alt.badge.img+3
  • Community Manager
  • 671 replies
  • May 12, 2025

Hi ​@akshayl09, I’m closing this thread for now. If you have any follow up questions please feel free to share them in the comments or create a new post 🙋🏻‍♀️


Reply


Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

 
Cookie settings