Solved

Approximate string matching - fuzzy matching / lookup - How do I create a fuzzy match?

2 years ago
December 16, 2022
3 replies
266 views

Marnix Wisselaar
Star Blazer L2
46 replies

Hi,

We'd like to compare two string-fields with several words. Is there a fuzzy matching or approximate string matching step for a plan?

In this case it is not reference of lookup data. They are name fields. One is in the DB the other in a file.

Kind regards, Marnix

Best answer by AKislyakov

Hi Marnix, there are two ways how you can approach the task:

If one of the sources is kind-of reference source and you know it is better than the other. You can create a lookup file from the better one (with Approximative Index checked) and then use it to find similar strings in the other source. There are two examples in the Tutorial project you can use as a guidance. “03.02 LKPs for enrichment.plan” -- how to build an Approximative lookup and “04.03 Approximative lookup.plan” -- how to use it.
If your task is something like matching customers. Then you can use the Matching step and build Matching keys on n-grams and then use one of the fuzzy matching functions to narrow the results. If this is the case, I’d suggest as a first step to use the Canopy Clustering step. It is pre-configured to use mentioned algorithm and can be used to get some estimates on what result could look like. There is again an example in “07.10 Clustering.plan”.

View original

Did this topic help you find an answer to your question?

AKislyakov
Ataccamer
165 replies
Answer
2 years ago
December 17, 2022

Hi Marnix, there are two ways how you can approach the task:

If one of the sources is kind-of reference source and you know it is better than the other. You can create a lookup file from the better one (with Approximative Index checked) and then use it to find similar strings in the other source. There are two examples in the Tutorial project you can use as a guidance. “03.02 LKPs for enrichment.plan” -- how to build an Approximative lookup and “04.03 Approximative lookup.plan” -- how to use it.
If your task is something like matching customers. Then you can use the Matching step and build Matching keys on n-grams and then use one of the fuzzy matching functions to narrow the results. If this is the case, I’d suggest as a first step to use the Canopy Clustering step. It is pre-configured to use mentioned algorithm and can be used to get some estimates on what result could look like. There is again an example in “07.10 Clustering.plan”.

Marnix Wisselaar
Author
Star Blazer L2
46 replies
2 years ago
December 19, 2022

Hi,

I enjoyed a little experiment with this and found that actually combining the two gives really good results, because both the Levenshtein and Canopy have their strong points.

So here's the logic:

Create a Lookup first
Match based on the Levenshtein with the lookup
The rest of the file try to match based on the Canopy
Union and ready to go

Enclosed the plan, created in v13.7 It is a bit messy, but gives the idea.

You can see the two different approaches in the ComparingLevenshteinAndCanopy.plan. They are integrated in the FuzzyMatchingSample.

Of course an additional step would be to compare the to be matched values to itself and make sure to get rid of possible duplicates first.

1 Attachments

Fuzzy Matching.zip

Marnix Wisselaar
Author
Star Blazer L2
46 replies
2 years ago
December 20, 2022

Compared it to FuzzyWuzzy and the results are quite good. Maybe I like them even better:

import pandas as pd
from fuzzywuzzy import fuzz

# Read in the CSV files
inpt = pd.read_csv('FuzzyMatching Input.csv')
mtch = pd.read_csv('FuzzyMatching Match.csv')

print(inpt)
print(mtch)

# Get a list of all the fields to compare
colunmIn = inpt['Input'].tolist()
colunmMatch = mtch['Match'].tolist()

# Initialize a list to store the duplicates
duplicates = []

# Loop through the list of match and check for duplicates against input
for i in range(len(colunmIn)):
for j in range(i+1, len(colunmMatch)):
if fuzz.token_set_ratio(colunmIn[i], colunmMatch[j]) > 75:
matched = str(colunmIn[i]), str(colunmMatch[j]), 'token_set_ratio'
duplicates.append(matched)

# Print the list of duplicates
dups = pd.DataFrame(duplicates)
print(dups)
dups.to_csv('results.csv', index=False)

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos + marketing

1 Attachments

Reply

Related topics

FME Desktop won't load data from this WFS service.icon

FME desktop won't load data from wfs service.icon

fme 2019 server fault tolerant setup for netscalericon

403 Forbidden Error When Making FME Server Request FME server 2018.icon

WMS not displaying in FME Desktop, loads fine in ArcGIS Pro.icon

Most Liked this week

Sign up

Login to the Ataccama Community

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings