When it comes to data quality (DQ) management, monitoring projects provide us with ability to export DQ results and maintain a close eye on data quality. However, the initial export contains attributes organized in columns, alongside additional columns housing information regarding the DQ results for these attributes. Analysing this data can be challenging because DQ rule results and explanations are presented as a single string.
Despite the fact, that extracting DQ results in a customized format might require a more detailed approach, it might be better aligning with the client's specific needs. In this guide, you will find information how to export DQ results tailored to specific requirements.
The Task at Hand:
Our objective is to ensure a separate line for each attribute in an export file, with the following columns related to DQ results:
- the name of the DQ rule;
- an explanation for the check;
- a list of terms assigned to that attribute;
- the attribute value.
We have three main steps:
- Attribute Collection: during this process, we gather attributes and their associated DQ results.
- Term Collection: in the subsequent step, we compile our list of terms for each attribute.
- Attribute and Term Combining: At this stage, the data meets with the list of terms for attributes from previous step. After merging of these elements, we create the final custom export format, which includes attribute values and their corresponding DQ results in a single row. This is not only split the aggregated information, but also it enhances its flexibility for further analysis.
-
Let’s describe what for we are using some steps here:
- Filter Step: Filters records with non-null "invalid_rules."
- AlterFormat Step: Adds new columns with specified names and data types.
- ColumnAssigner Step (Invalid Rules Exp): Replaces commas with semicolons in "invalid_rules_explanation."
- ColumnAssigner Step: Filters out "OTHER" or "N/A" entries in "invalid_rules_explanation."
- RegexSplitter Step: Splits valid rules into separate rows per record.
- RegexMatchingAlgorithm Step: Uses regular expressions to capture and extract rule details.
- DynamicExpressionAssigner Step: Concatenates attributes, handling NULL values.
- AlterFormat Step: Adds new columns: "business_id," "datetime," and "source."
- ONE Matadata Reader Step (dqCheck): Reads and processes data.
- Join 3: Performs an inner join based on catalog attributes.
- Group Aggregator 2: Aggregates data based on "catalogItemAttribute."
- Attribute Step: Reads attribute terms
- Only active terms: delete from Attribute Step terms which were deleted manually but still are present in the list.
- Only active attributes: delete terms from Attribute Step terms which were deleted manually, but still presented in the list.
- Group Aggregator: Aggregates data based on "dqCheckId."
- Join Step: Combines attributes with DQ results and terms.
Monitoring projects provide convenient access to in-depth insights on data quality by exporting results in a customized format.
The material was prepared in collaboration with