Australian Tax Files Numbers, commonly known as TFNs, are 9-digit numbers where the last digit is a validation check-digit.
How would we go about setting up Detection and DQ Evaluation Rules for TFNs?
The following article applies to ONE version 13.x. The screenshots are from v13.9.
For the Detection Rule, we might want to ignore any extraneous characters in the data, for example, the data that is profiled might contain, TFN:123456782, that could be easily transformed into a valid TFN by removing the prefix.
For the DQ Evaluation Rule, we might want to consider values that contain the TFN: prefix like in previous example as invalid, to indicate that further processing/cleansing of the data is required.
The good news is that we can use almost the same condition for detecting and validating the numeric part of the TFN. For the DQ Evaluation Rule, we’ll use an extra condition to check that only numbers are present.
Detection Rule
To configure the Detection Rule in ONE, head to the Rules section under Data Quality and click on Create.
Give the rule a name, Tax File Number, an optional description, and click Save.
On the Overview tab, optionally add the Glossary Term, “Tax File Number”. If we haven’t defined the term previously, we can add it here and continue with the rule definition.
Now, let’s switch to the Implementation tab.
Add a single Input Attribute, which for the sake of simplicity we’ll call, value, of type String.
In the Rule Logic section, select Detection from the rule type drop-down.
To the right of Condition, click the Condition Builder drop-down and change to Advanced Expression. Click Change when prompted to confirm changing the configuration method.
Add the following expression to the Condition configuration box:
digits_only := trashNonDigits( value );
num_digits := length( digits_only );
digit_1 := toInteger( substr( digits_only, 0, 1 ) );
digit_2 := toInteger( substr( digits_only, 1, 1 ) );
digit_3 := toInteger( substr( digits_only, 2, 1 ) );
digit_4 := toInteger( substr( digits_only, 3, 1 ) );
digit_5 := toInteger( substr( digits_only, 4, 1 ) );
digit_6 := toInteger( substr( digits_only, 5, 1 ) );
digit_7 := toInteger( substr( digits_only, 6, 1 ) );
digit_8 := toInteger( substr( digits_only, 7, 1 ) );
digit_9 := toInteger( substr( digits_only, 8, 1 ) );
prod_1 := digit_1;
prod_2 := digit_2 * 4;
prod_3 := digit_3 * 3;
prod_4 := digit_4 * 7;
prod_5 := digit_5 * 5;
prod_6 := digit_6 * 8;
prod_7 := digit_7 * 6;
prod_8 := digit_8 * 9;
prod_9 := digit_9 * 10;
sum := prod_1 + prod_2 + prod_3 + prod_4 + prod_5 + prod_6 + prod_7 + prod_8 + prod_9;
num_digits == 9
and sum%11 == 0
The rule should look something like this:
Note, that we haven't needed to define any Variables since we’re using the Advanced Expression builder.
We can now test the rule logic by clicking on Test Rule.
Once we’re happy, let’s click on Publish. If the Workflow bar isn’t currently visible, scroll up to the top of the page.
Now let’s assign the Detection Rule to our Business Term.
Navigate to the Term in the Business Glossary, go to the Settings tab and click on Add Rules. Find our Tax File Number rule, tick the checkbox next to it and configure the threshold as appropriate for your data.
Publish the changes to the Term.
From now on, whenever we profile a data source that contains (more than the threshold of) TFNs, the Business Term, Tax File Number, should be automatically assigned to the corresponding attribute.
Detection Rule done! Now we’ll configure the DQ Evaluation Rule. Much of the process for configuring the DQ Evaluation Rule is the same as it was for the Detection Rule.
DQ Evaluation Rule
To configure the DQ Evaluation Rule, go to the Rules section under Data Quality and click on Create.
Give the rule a name, validation Tax File Number, an optional description, and click on Save.
On the Overview tab, add the Glossary Term, Tax File Number, so that our rule will be suggested when setting up monitoring on sources that contain Tax File Number tagged attributes.
Switch to the Implementation tab, and add a single Input Attribute, value, of type String.
In the Rule Logic section, select Validity from the rule type drop-down.
Feel free to keep the IS_EMPTY condition if we want to check for missing values.
Either change the IS_EMPTY condition or click Add condition.
Select does not match regexp and enter \d{9} in the regular expression field. This regex will check that value contains exactly nine digits. Add an appropriate Explanation, e.g., TFN_NOT_NINE_DIGITS, and a Score, as desired.
Add another Advanced Expression condition using the same logic from our Detection Rule, with one minor change. Invert the logic, so instead of checking for valid TFNs we’re checking for invalid ones. That is, replace:
num_digits == 9
and sum%11 == 0
with :
num_digits <> 9
or sum%11 <> 0
Now the condition will check if the input does not contain a valid TFN. Update the Explanation to something meaningful, e.g. TFN_INVALID_CHECK_DIGIT and a Score.
Test the Rule by scrolling up and clicking on the Test Rule button.
See if you can work out to add some flexibility to the rule, for example, allowing embedded hyphens and/or spaces.
Publish the rule.
Finally, find a data source containing TFNs, profile the source and then set up an appropriate Monitoring Project to check the quality!