Skip to main content

Happy Monday Community!

 

We are continuing our ONE Desktop best practices series this week as well! Today on the menu is - DQ Scoring!

Data quality scoring is a vital process in Ataccama Desktop, enabling you to assess data quality based on specific rules and assign a numerical score. This score helps differentiate between high-quality and low-quality records, making it an essential step in data processing.

Here are the best practices for data quality scoring in Ataccama Desktop:

1. Categorize Scoring into Four Levels

To effectively score data quality, categorize your cleansing scores into four quantification levels:

Scoring Level Scoring Description Scoring Result
0 No modification was done to the input value. VALID
< 10,000 Small modification was done to the input value. VALID/CORRECTED
< 10,000,000 Huge modification was done to the input value. UNSAFE/UNKNOWN
> 10,000,000 Input value or pre-cleansed value is null. NOT VALID

Each cleansing operation, as described in the "Scoring Description" column, corresponds to a specific score value and explanation code. For instance, the table contains rules, scores, and explanation codes for scoring a Social Security number within the < 10,000 scoring level.
 

2. Store Scores and Explanation Codes Separately

To maintain clarity and organization, store scores and explanation codes in separate columns for each attribute, following the naming convention: sco_oattribute] and exp_pattribute]. You can aggregate multiple explanation codes in the explanation field, separated by whitespace.

For example, your scoring results for Social Security numbers might look like this:

src_sin out_sin exp_sin sco_sin Scoring Result
573896836 573896836 SIN_INVALID_CHECK 10,000 NOT VALID
417 208 303 417208303 SIN_REMOVED_SEPARATORS SIN_VALID 100 VALID/CORRECTED
417-208303 417208303 SIN_DIFFERENT_SEPARATORS SIN_VALID 200 VALID/CORRECTED
sin 417 - 208 -303 417208303 SIN_COMMENT SIN_DIFFERENT_SEPARATORS SIN_VALID 1,200 VALID/CORRECTED
4 172   083  0      3 417208303 SIN_SPACES_AT_UNUSUAL_POSITIONS SIN_VALID 300 VALID/CORRECTED
aaabbbccc aaabbbccc SIN_NOT_PARSED 1,000,000 NOT VALID


3. Define Scoring Rules within Range Limits

Ensure that your scoring rules are designed so that, within a given range, the sum of scores for corresponding operations does not exceed the upper bound. This is crucial for maintaining consistency and accuracy in your scoring process.
 

4. Gradate Rules by Importance

To prioritize certain rules over others, use a factor of 10 to gradate them by importance. For example, you can assign scores of 1,000 for less critical rules and 10,000 for more critical ones, allowing you to focus on the most significant data quality issues.


Attribute Scoring vs. Record Scoring

Scoring rules are typically applied to individual attributes, but you can also aggregate scoring results at the record level, which is particularly useful for matching and merging purposes. Here's how to handle record scoring effectively:


1. Score Individual Attributes

Follow the best practices mentioned above to score individual attributes accurately.


2. Divide Attributes by Importance

Consider categorizing attributes into two groups: important and not important. This division helps you prioritize certain attributes for scoring and subsequent data processing. For instance, attributes like gender and phone number may have lower importance for matching purposes.

Attribute Value Importance
out_first_name John IMPORTANT
out_last_name Rasmusen IMPORTANT
out_gender 1 NOT IMPORTANT
out_birthdate_day 16.11.1978 IMPORTANT
out_credit_card 720412410 IMPORTANT
out_email janedoe@email.com IMPORTANT
out_phone_number 12 012 004 548 NOT IMPORTANT
out_sin 720412410 IMPORTANT


3. Use Two Record Score Columns

To aggregate record scores efficiently, use two columns:

  1. sco_instance: The sum of all attribute scores.
  2. sco_instance_imp_att: The sum of scores for important attributes.

Accurate scoring is key to making informed decisions about data quality and improving overall data management. So let us know your own tips & tricks in the comments below 👇

Be the first to reply!

Reply