Solved

How to check duplicate records in ONE Desktop, and using fuzzy logic?

1 year ago
May 22, 2023
4 replies
424 views

suryakanth.Emisha
Data Pioneer

Hello. i have a general material data csv file, and in that i have to check for duplicates based on material data. Currently i'm trying with the matching step by creating partitions and key rules but still i'm not sure about it. So is there anyone knows about solving this process?? Thanks!

Best answer by AKislyakov

Hi Suryakanth,

You can find a tutorial on how to configure the Matching step in the Tutorials project > 09 Match and merge > 09.01 Match and merge.plan.

If you're new to this, I suggest trying the Canopy Clustering step first. It requires less configuration and is more user-friendly. You can find an example usage in the Tutorials project > 07 Analyze > 07.10 Clustering.plan.

The Canopy Clustering step allows you to setup components which will be used to search for duplicates, their weights and thresholds when to consider two records belonging to the same group.

View original

AKislyakov
Ataccamer
1 year ago
May 25, 2023

Hi Suryakanth,

You can find a tutorial on how to configure the Matching step in the Tutorials project > 09 Match and merge > 09.01 Match and merge.plan.

The Canopy Clustering step allows you to setup components which will be used to search for duplicates, their weights and thresholds when to consider two records belonging to the same group.

suryakanth.Emisha
Data Pioneer
1 year ago
May 28, 2023

Hello @AKislyakov ,I'll work on this method and let you know the resultant, thanks.

Radziah
Data Voyager
1 year ago
October 24, 2023

Hi @AKislyakov, I am following the guide on this tutorial however, it’s quite hard for me to understand each functionality although I used the help button to understand what is it all about. What I am looking for is to detect duplicates on several columns like Name, Address, etc. After following the guide, I found that my result turned up to a be a bit confusing whereby different Name values belong to one same cluster. For example Yayasan A and Kementerian in one cluster ID while I am expecting Name like Yayasan A, Yysn, Yayasn A belong to the same cluster and Kementerian belong to a different cluster ID. Where should I change in the Canopy Clustering the configuration to get the correct result?

Cansu
Community Manager
1 year ago
October 30, 2023

Hi @Radziah, thanks for posting! Have you had a chance to check any resources on the Jaccard Index? If you haven’t, I would highly suggest checking some such as this, for more perspective on clustering that can possibly clarify the results. If you need additional support please feel free to raise a support ticket with screenshots at support.ataccama.com and our team would be happy to help 🙋🏻‍♀️

Check out our Quick Start Guide to get started on the community 🙋‍♀️

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

Reply

Related topics

Kafka Streaming Use Case Examples

Setup/Tuning Kafka streaming in MDM

Ways to manage rule repository with rule folders 📂

Data People Community Live Training #8: Monitoring Projects

Data People Community Live Training #7 EMEA/NA Edition: ONE Desktop Metadata & DQ Steps ✨

Sign up

Login to the Ataccama Community

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings