Happy Monday Community!
We are continuing our ONE Desktop best practices series this week as well! Today on the menu is - DQ Scoring!
Data quality scoring is a vital process in Ataccama Desktop, enabling you to assess data quality based on specific rules and assign a numerical score. This score helps differentiate between high-quality and low-quality records, making it an essential step in data processing.
Here are the best practices for data quality scoring in Ataccama Desktop:
1. Categorize Scoring into Four Levels
To effectively score data quality, categorize your cleansing scores into four quantification levels:
Scoring Level | Scoring Description | Scoring Result |
---|---|---|
0 | No modification was done to the input value. | VALID |
< 10,000 | Small modification was done to the input value. | VALID/CORRECTED |
< 10,000,000 | Huge modification was done to the input value. | UNSAFE/UNKNOWN |
> 10,000,000 | Input value or pre-cleansed value is null. | NOT VALID |
Each cleansing operation, as described in the "Scoring Description" column, corresponds to a specific score value and explanation code. For instance, the table contains rules, scores, and explanation codes for scoring a Social Security number within the < 10,000 scoring level.
2. Store Scores and Explanation Codes Separately
To maintain clarity and organization, store scores and explanation codes in separate columns for each attribute, following the naming convention: sco_oattribute] and exp_pattribute]. You can aggregate multiple explanation codes in the explanation field, separated by whitespace.
For example, your scoring results for Social Security numbers might look like this:
src_sin | out_sin | exp_sin | sco_sin | Scoring Result |
---|---|---|---|---|
573896836 | 573896836 | SIN_INVALID_CHECK | 10,000 | NOT VALID |
417 208 303 | 417208303 | SIN_REMOVED_SEPARATORS SIN_VALID | 100 | VALID/CORRECTED |
417-208303 | 417208303 | SIN_DIFFERENT_SEPARATORS SIN_VALID | 200 | VALID/CORRECTED |
sin 417 - 208 -303 | 417208303 | SIN_COMMENT SIN_DIFFERENT_SEPARATORS SIN_VALID | 1,200 | VALID/CORRECTED |
4 172 083 0 3 | 417208303 | SIN_SPACES_AT_UNUSUAL_POSITIONS SIN_VALID | 300 | VALID/CORRECTED |
aaabbbccc | aaabbbccc | SIN_NOT_PARSED | 1,000,000 | NOT VALID |
3. Define Scoring Rules within Range Limits
Ensure that your scoring rules are designed so that, within a given range, the sum of scores for corresponding operations does not exceed the upper bound. This is crucial for maintaining consistency and accuracy in your scoring process.
4. Gradate Rules by Importance
To prioritize certain rules over others, use a factor of 10 to gradate them by importance. For example, you can assign scores of 1,000 for less critical rules and 10,000 for more critical ones, allowing you to focus on the most significant data quality issues.
Attribute Scoring vs. Record Scoring
Scoring rules are typically applied to individual attributes, but you can also aggregate scoring results at the record level, which is particularly useful for matching and merging purposes. Here's how to handle record scoring effectively:
1. Score Individual Attributes
Follow the best practices mentioned above to score individual attributes accurately.
2. Divide Attributes by Importance
Consider categorizing attributes into two groups: important and not important. This division helps you prioritize certain attributes for scoring and subsequent data processing. For instance, attributes like gender and phone number may have lower importance for matching purposes.
Attribute | Value | Importance |
---|---|---|
out_first_name | John | IMPORTANT |
out_last_name | Rasmusen | IMPORTANT |
out_gender | 1 | NOT IMPORTANT |
out_birthdate_day | 16.11.1978 | IMPORTANT |
out_credit_card | 720412410 | IMPORTANT |
out_email | janedoe@email.com | IMPORTANT |
out_phone_number | 12 012 004 548 | NOT IMPORTANT |
out_sin | 720412410 | IMPORTANT |
3. Use Two Record Score Columns
To aggregate record scores efficiently, use two columns:
- sco_instance: The sum of all attribute scores.
- sco_instance_imp_att: The sum of scores for important attributes.
Accurate scoring is key to making informed decisions about data quality and improving overall data management. So let us know your own tips & tricks in the comments below