AI enabled term detection is giving significant false positives. How can i tune it

1 year ago
May 21, 2024
4 replies
47 views

+1

Prasad Rani
Data Pioneer

For terms like USA Zipcode, USA Social Security Number etc., - the term is enabled with AI detection and it is getting tagged to hundreds of fields - most of them are NOT zipcode or SSN.

If it was giving a score and recommendation, I would have discarded the recommendation, but it is assigning the term automatically and I have to go in and remove that - which became painful. I have now disabled the AI detection but is there a better way to tune it to make it work better?

+1

PetrD
Ataccamer
1 year ago
May 24, 2024

Hi,
there are two mechanisms which are often confused with each other:
1) Rule based term detection - term instances are detected and automatically assigned by data and meta-data rules. This functionality is enabled by assigning Detection Rules in Settings for a given term.
2) Term Suggestions - terms instances are detected by an AI algorithm and then suggested (never assigned directly) and have to be confirmed / rejected by a user. This functionality can be enabled / disabled by the "AI detection" in the Settings for each term right above the detection rules.

I suppose you are talking about the 2) Term Suggestions because you wrote that it can be influenced by disabling "AI detection".
It is possible that Term Suggestions algorithm provides a lot of false positive suggestions in the beginning, because the algorithm starts untrained for each term (does not understand its meaning and where it should be suggested) and is learning as the platform is used. It learns from examples of attributes to which the given term is assigned by the users and from the fact that the users accepted / rejected its suggestions on different attributes. It can get the negative examples (that the term instances should not be assigned to an attribute) only from humans rejecting the false positive suggestions, so it is necessary to do that. It should be enough to reject a few suggestions to start seeing that a significant amount of the false positive suggestions (for the given term) disappear. However, note that the algorithm is updating the suggestions in an asynchronous manner so it might take minutes or even hours to learn in cases the catalog contains a lot of attributes (e.g. millions).

+1

PetrD
Ataccamer
1 year ago
May 24, 2024

Prasad Rani wrote:

For terms like USA Zipcode, USA Social Security Number etc., - the term is enabled with AI detection and it is getting tagged to hundreds of fields - most of them are NOT zipcode or SSN.

If it was giving a score and recommendation, I would have discarded the recommendation, but it is assigning the term automatically and I have to go in and remove that - which became painful. I have now disabled the AI detection but is there a better way to tune it to make it work better?

I am confused by this, because the AI detection (Term suggestions) should never assigne a term instance automatically, they should only suggest and in that case there should be a confidence score associated with the suggestion. On the other hand, if you are talking about term instance assignment based on rules, the checkbox "AI detection” should not have any influence on this functionality. Which version of the platform are you using?

+1

Prasad Rani
Data Pioneer
1 year ago
May 29, 2024

Iam using 14.5.1 and Iam talking about Term Detection. With No Rule assigned for detection and just AI Enabled flag is ON - it is tagging several hundred of my fields with such terms (like zipcode, first name, lastname) and most of them are false positives. When I disabled the AI detection, and no rule assigned - it did nothing - meaning no field mapped to the term. Then I kept the AI disabled - and assigned detection rules - one as value based detection and another metadata based (based on column name) - and now it is working as expected.

I was only expressing my concern around AI enabled term assignment making several incorrect assignments. And not even sure if it learns from removing the term assignments. But because, there were several hundred false assignment - I removed the term and recreated the term with detection rules.

+1

PetrD
Ataccamer
1 year ago
May 30, 2024

Thanks for the clarification. Then it is as I wrote - the AI Term Detection algorithm should learn quite quickly so if you start rejecting the suggestions, the confidence of the remaining ones should go down up to the point when they start disappearing.
What can also confuse the AI algorithm is if you have some term instance assigned incorrectly, because it takes it then as a truth and suggests the same term on similar attributes. But even in this case if could be fixed by rejecting several wrong suggestions (besides the better solution of removing the wrongly assigned term instance).

AI enabled term detection is giving significant false positives. How can i tune it

Reply

Cookie policy

Cookie settings

Reply

Related topics

Getting Started with Data Observability: Advanced Configurations

Term suggestions overwhelming - How to tweak and make it stop.icon

Mastering Anomaly Detection in Your Data Catalog📔

A guide to Data Quality validation: Understanding % threshold in regex-based detection checks

[Part II] Version 15.3 is here! Updates to ONE AI, ONE, and ONE Data 🤖

Sign up

Login to the Ataccama Community

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings