Hi @akshayl09 !
Thank you for your question.
You would indeed need to consider several things in the architecture when working with large data sets.
- Scalability: DPE (Data Processing Engine) can be scaled horizontally to handle increasing data volumes.
- Resource Allocation : Make sure memory is scaled appropriately.
Monitor running tasks, identify performance issues or failures. Review, change, and adjust the configurations. There is no one perfect configuration, you need to tweak for your process.
It will also depends heavily on the data source, e.g. it is possible to use ‘Pushdown’ feature for some of the data sources. It means executing data quality operations (profiling, rule evaluation) directly within the Big Data source (e.g., Snowflake, Databricks, Synapse). Ataccama ONE Gen2 translates data quality configurations into native SQL (or equivalent) and executes them on the platform.
Benefits:
- Reduced Data Transfer: Minimizes data movement, saving bandwidth and time.
- Leverages Platform Resources: Utilizes the processing power and scalability of the Big Data platform.
- Improved Performance: Faster execution due to optimized processing within the data source.
Also, in case you wanted to run Monitoring Project only on top of some part of the data set, you could take advantage of data slicing feature -https://docs.ataccama.com/one/latest/catalog-items/create-data-slice.html
Hope this helps! Let me know if you have any further questions.
Kind regards,
Ekaterina