Skip to main content

Hi,

We'd like to compare two string-fields with several words. Is there a fuzzy matching or approximate string matching step for a plan?

In this case it is not reference of lookup data. They are name fields. One is in the DB the other in a file.

 

Kind regards, Marnix

Hi Marnix, there are two ways how you can approach the task:

  1. If one of the sources is kind-of reference source and you know it is better than the other. You can create a lookup file from the better one (with Approximative Index checked) and then use it to find similar strings in the other source. There are two examples in the Tutorial project you can use as a guidance. “03.02 LKPs for enrichment.plan” -- how to build an Approximative lookup and “04.03 Approximative lookup.plan” -- how to use it.
  2. If your task is something like matching customers. Then you can use the Matching step and build Matching keys on n-grams and then use one of the fuzzy matching functions to narrow the results. If this is the case, I’d suggest as a first step to use the Canopy Clustering step. It is pre-configured to use mentioned algorithm and can be used to get some estimates on what result could look like. There is again an example in “07.10 Clustering.plan”.

Hi,

I enjoyed a little experiment with this and found that actually combining the two gives really good results, because both the Levenshtein and Canopy have their strong points.

So here's the logic:

  • Create a Lookup first
  • Match based on the Levenshtein with the lookup
  • The rest of the file try to match based on the Canopy
  • Union and ready to go

Enclosed the plan, created in v13.7 It is a bit messy, but gives the idea.

You can see the two different approaches in the ComparingLevenshteinAndCanopy.plan. They are integrated in the FuzzyMatchingSample.

Of course an additional step would be to compare the to be matched values to itself and make sure to get rid of possible duplicates first.


Compared it to FuzzyWuzzy and the results are quite good. Maybe I like them even better:

import pandas as pd
from fuzzywuzzy import fuzz

# Read in the CSV files
inpt = pd.read_csv('FuzzyMatching Input.csv')
mtch = pd.read_csv('FuzzyMatching Match.csv')

print(inpt)
print(mtch)

# Get a list of all the fields to compare
colunmIn = inpt['Input'].tolist()
colunmMatch = mtch['Match'].tolist()

# Initialize a list to store the duplicates
duplicates = []


# Loop through the list of match and check for duplicates against input
for i in range(len(colunmIn)):
  for j in range(i+1, len(colunmMatch)):
    if fuzz.token_set_ratio(colunmInmi], colunmMatchtj]) > 75:
        matched =    str(colunmIn(i]), str(colunmMatchlj]), 'token_set_ratio'
        duplicates.append(matched)

# Print the list of duplicates
dups = pd.DataFrame(duplicates)
print(dups)
dups.to_csv('results.csv', index=False)


Reply