Data Quality and Integration Scenarios

  • 12 January 2023
  • 0 replies
  • 337 views

Userlevel 4
Badge +1

Introduction

This post was created to answer the following questions:

  • When moving data through data pipelines, how can the Ataccama ONE DQ platform support Data Integration platforms other than its own.
  • How do other Integration Platforms access the DQ rules?
  • How do other Integration Platforms flag records for Data Stewards to review?

DQ platform in context of integration processes

There are three key patterns that are typically used for integration of the ETL process with Data Quality:

Option 1

The integration platform calls Ataccama DQ rules exposed as real-time API, sending in batches or micro-batches of data for evaluation.

_Fjg0QDScuptBU9EF2XMICnrGkRrYA_UeTJIeeZNRSoAhQQu_U3Xtuiu0r_7nXaY1skc-oRBVs0ygwmPn8pUHfyjAI_jiUKRNdCz6uW7Z3mlyWzghzgtaFMEA-CPFGNjtJykmA6KSeIN7eMdnN2cdBwfQ5eA1xLajAzgOqBYMqjp-XW34V_TVro7RgVB

Option 2

The Integration platform loads data into a staging/landing layer and triggers (via API call) a batch process in Ataccama, Ataccama evaluates/improves the data and populates the cleansed layer in the data lake, including DQ scores and flags.

JLBHYeP-SSkOb6xhCzMo8pKZgyy9jtakOF4GsArhx2f9ivJHaU-0dCK1dW5nUp1m7jM3gUw-bRz8osIvlcNiyQpgJkD2DTsE7illp_e7yk8aJVxmT61e687M-t0kRrpcAuBpo2JziU-wdk-9DajJW4iPSAYhoom00ieGgLpNrWImvQ_q1S-X8b5qhKb0
 

Option 3

Ataccama can be connected to a data stream, such as Kafka, and on the fly evaluating data in the stream - every message is checked, scored and returned back into an outgoing topic or even sorted into either valid or invalid outgoing topic.

4TvE-CiqMp5Ex1OB6F5NgFSBDbK7_yKc0-Bfn_-w6OsdgHoPIPCC8Q5JHA00fcwggxo_o4ie2lSSlzgRjO8ovMnJdkCScJkjJNzbQg6ZCrZ63lz10x_FzXmAWbWC_vzDy7sfEuQ5sWlCWuX3Fw-jGAAaQcue2QLqjTbnvojVLBW16727iGPTwIgHBUpz
 

Data pass, data discard and flow stop

And as for how the DQ process fits into the overall execution of the integration flows: Data Quality is typically checked at two stages of any ETL process - at Extraction and after Transformation. Four different types of measures are typically taken when checking data that is being extracted and transformed:

  • Pass the record without error (= ignore),

  • Pass the record and mark the affected data/column (= flag),

  • Discard/sort out the record (= detect & extract),

  • Stop the integration flow (= raise critical error).

Typically what we'd want to do is flag any invalid data (in terms of completeness, correctness, consistency or timeliness) and let the integration process run. Results of the DQ evaluation and an aggregated summary of any flagged problems are typically visualized in a DQ dashboard or report - either within Ataccama platform or in 3rd party reporting such as PowerBI. 

Only some particular exceptions based on defined criteria are typically turned into "issues" (= workflows) that will require Data Steward review or approval. These types of extracted issues do not typically interrupt the integration process, they are raised and resolved afterwards. Any update or resolution from the Data Steward can be applied to the transformed data "ex-post". 

Critical errors that result in the integration flow being stopped are typically based on incorrect data volumes/structures (which is something the integration platform can detect on its own) or based on the relative volume of flagged/invalid data values being too high. It is typically the integration platform that would calculate the aggregated validity of the processed dataset based on the DQ outputs from Ataccama processing

This same approach applies to the cloud-based integration platform as well as traditional ETLs. With Azure it would typically be via REST API calls to Ataccama DQ rules using ADF Logic Apps. Cloud Data Fusion would also either trigger the job in Ataccama via REST API call or trigger a batch job on top of the created files, relational or non-relational databases.

Examples

Example 1: Ataccama ONE DQ & Azure Synapse

Below is an example of a Azure-based solution where it's not just Data Quality, but also Ataccama Catalog and Master Data Management that are integrated with the Azure-based process:

1q6kFCIzBFHgTSSBa4ae5ey7ofp5xwG9xoaobTHgMl67tA66TOsahm5LDJPVPxNgpivRqy-5sP1tnC2Q826_Tonwv0a8Rm8h47OhuaSFYz11odivoY_Zs-iQccjQbl2SIu6qZA006Yx7Ezel2_EWI8PBbMND4wkeN_77G09sR4nP9Mv1hNeqoNzahdRl

Example 2: Ataccama ONE DQ & Microsoft SSIS

1oaLQcDFFn4LYs9Rx32wZseMTDBjfZ2q8wQ8Udp7Sg94wlKXqNPVOaouiQwYurnBFMrV0Fkr3y9ZrZQtvM06qQY1yCs_f7eIFWlIj8xFMltMXIhBV-MUOcwQ68gEaxwNOL8MjY-fvQZiJMhNaTdQ5I0zP3pbIdm8boE8gwhKGO-P6eGrBPohtckpsqkk

zjLl5fK9XmpIzW9GMtsNH_O1N7awjaLTNbA6dNGwN_gbe5l4WhGyilvpmWrvvF2IzN2OaLL8CTKOiZYk-MwV_ZVSI2CsWKv7kqki9R84S2FrkI0M8UjmzhlFuab9a9L-OHlsy8ZuhJ1-Kb6vDgTNIvW7envTtQ91nxEWSX4CToNO2hFibEtkVT1bfbjD

Thanks to @tombolek for this content!

 


0 replies

Be the first to reply!

Reply