I have 2 different Dataframes for which I am trying to match strings columns (names)
Below are just some sample of DF's
df1 (127000,3)
Code Name PostalCode
150 Maarc 47111
250 Kirc 41111
170 Moic 42111
140 Nirc 44111
550 Lacter 47111
df2 (38000,3)
Code NAME POSTAL_CODE
150 Marc 47111
250 Kikc 41111
170 Mosc 49111
140 NiKc 44111
550 Lacter 47111
The aim is to create another DF3 as shown below
Code NAME Best Match Score
150 Marc Maarc 0.9
250 Karc Kirc 0.9
The following code gives the expected output
import difflib
from functools import partial
f = partial(difflib.get_close_matches, possibilities= df1['Name'].tolist(), n=1)
matches = df2['NAME'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio()
for x, y in zip(matches, df2['NAME'])]
df3 = df2.assign(best=matches, score=scores)
df3.sort_values(by='score')
The problem
To match those strings for only 2 rows it takes around 30sec. This task has to be done for 1K rows which will take hours!
The Question
How can I speed up the code? I was thinking about something like fetchall?
EDIT
Even the fuzzywuzzy libraries has been tried, which takes longer than difflib with the following code:
from fuzzywuzzy import fuzz
def get_fuzz(df, w):
s = df['Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
idx = s.idxmax()
return {'Name': df['Name'].iloc[idx], 'CODE': df['Code'].iloc[idx], 'Value': s.max()}
df2['NAME'].apply(lambda x: get_fuzz(df1, x))
df2 = df2.assign(search= df2['NAME'].apply(lambda x: get_fuzz(df1, x)))
sklearnmodule. For your case the levenshtein distance may be interesting.