ONE Desktop: Data Quality Scoring 🧮

Happy Monday Community!

We are continuing our ONE Desktop best practices series this week as well! Today on the menu is - DQ Scoring!

Data quality scoring is a vital process in Ataccama Desktop, enabling you to assess data quality based on specific rules and assign a numerical score. This score helps differentiate between high-quality and low-quality records, making it an essential step in data processing.

Here are the best practices for data quality scoring in Ataccama Desktop:

1. Categorize Scoring into Four Levels

To effectively score data quality, categorize your cleansing scores into four quantification levels:

Scoring Level	Scoring Description	Scoring Result
0	No modification was done to the input value.	VALID
< 10,000	Small modification was done to the input value.	VALID/CORRECTED
< 10,000,000	Huge modification was done to the input value.	UNSAFE/UNKNOWN
> 10,000,000	Input value or pre-cleansed value is null.	NOT VALID

Each cleansing operation, as described in the "Scoring Description" column, corresponds to a specific score value and explanation code. For instance, the table contains rules, scores, and explanation codes for scoring a Social Security number within the < 10,000 scoring level.

2. Store Scores and Explanation Codes Separately

To maintain clarity and organization, store scores and explanation codes in separate columns for each attribute, following the naming convention: sco_oattribute] and exp_pattribute]. You can aggregate multiple explanation codes in the explanation field, separated by whitespace.

For example, your scoring results for Social Security numbers might look like this:

src_sin	out_sin	exp_sin	sco_sin	Scoring Result
573896836	573896836	SIN_INVALID_CHECK	10,000	NOT VALID
417 208 303	417208303	SIN_REMOVED_SEPARATORS SIN_VALID	100	VALID/CORRECTED
417-208303	417208303	SIN_DIFFERENT_SEPARATORS SIN_VALID	200	VALID/CORRECTED
sin 417 - 208 -303	417208303	SIN_COMMENT SIN_DIFFERENT_SEPARATORS SIN_VALID	1,200	VALID/CORRECTED
4 172 083 0 3	417208303	SIN_SPACES_AT_UNUSUAL_POSITIONS SIN_VALID	300	VALID/CORRECTED
aaabbbccc	aaabbbccc	SIN_NOT_PARSED	1,000,000	NOT VALID

3. Define Scoring Rules within Range Limits

Ensure that your scoring rules are designed so that, within a given range, the sum of scores for corresponding operations does not exceed the upper bound. This is crucial for maintaining consistency and accuracy in your scoring process.

4. Gradate Rules by Importance

To prioritize certain rules over others, use a factor of 10 to gradate them by importance. For example, you can assign scores of 1,000 for less critical rules and 10,000 for more critical ones, allowing you to focus on the most significant data quality issues.

Attribute Scoring vs. Record Scoring

Scoring rules are typically applied to individual attributes, but you can also aggregate scoring results at the record level, which is particularly useful for matching and merging purposes. Here's how to handle record scoring effectively:

1. Score Individual Attributes

Follow the best practices mentioned above to score individual attributes accurately.

2. Divide Attributes by Importance

Consider categorizing attributes into two groups: important and not important. This division helps you prioritize certain attributes for scoring and subsequent data processing. For instance, attributes like gender and phone number may have lower importance for matching purposes.

Attribute	Value	Importance
out_first_name	John	IMPORTANT
out_last_name	Rasmusen	IMPORTANT
out_gender	1	NOT IMPORTANT
out_birthdate_day	16.11.1978	IMPORTANT
out_credit_card	720412410	IMPORTANT
out_email	janedoe@email.com	IMPORTANT
out_phone_number	12 012 004 548	NOT IMPORTANT
out_sin	720412410	IMPORTANT

3. Use Two Record Score Columns

To aggregate record scores efficiently, use two columns:

sco_instance: The sum of all attribute scores.
sco_instance_imp_att: The sum of scores for important attributes.

Accurate scoring is key to making informed decisions about data quality and improving overall data management. So let us know your own tips & tricks in the comments below

Be the first to reply!

1. Categorize Scoring into Four Levels

2. Store Scores and Explanation Codes Separately

3. Define Scoring Rules within Range Limits

4. Gradate Rules by Importance

Attribute Scoring vs. Record Scoring

1. Score Individual Attributes

2. Divide Attributes by Importance

3. Use Two Record Score Columns

Reply

Sign up

Login to the Ataccama Community

Scanning file for viruses.

This file cannot be downloaded