Speed up matching strings python

Question

I have 2 different Dataframes for which I am trying to match strings columns (names)

Below are just some sample of DF's

df1 (127000,3)
Code     Name     PostalCode
150      Maarc    47111
250      Kirc     41111
170      Moic     42111
140      Nirc     44111
550      Lacter   47111

df2 (38000,3)
Code     NAME     POSTAL_CODE
150      Marc     47111
250      Kikc     41111
170      Mosc     49111
140      NiKc     44111
550      Lacter   47111

The aim is to create another DF3 as shown below

Code     NAME    Best Match   Score
150      Marc    Maarc        0.9
250      Karc    Kirc         0.9

The following code gives the expected output

import difflib
from functools import partial
f = partial(difflib.get_close_matches, possibilities= df1['Name'].tolist(), n=1)

matches = df2['NAME'].map(f).str[0].fillna('')

scores = [difflib.SequenceMatcher(None, x, y).ratio()
          for x, y in zip(matches, df2['NAME'])]

df3 = df2.assign(best=matches, score=scores)
df3.sort_values(by='score')

The problem

To match those strings for only 2 rows it takes around 30sec. This task has to be done for 1K rows which will take hours!

The Question

How can I speed up the code? I was thinking about something like fetchall?

EDIT

Even the fuzzywuzzy libraries has been tried, which takes longer than difflib with the following code:

from fuzzywuzzy import fuzz

def get_fuzz(df, w):
    s = df['Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
    idx = s.idxmax()
    return {'Name': df['Name'].iloc[idx], 'CODE': df['Code'].iloc[idx], 'Value': s.max()}

df2['NAME'].apply(lambda x: get_fuzz(df1, x))

df2 = df2.assign(search= df2['NAME'].apply(lambda x: get_fuzz(df1, x)))

unfortunately I don’t think difflib is the right tool for this task, it’s not that fast — gold_cy
– gold_cy, Commented Jan 31, 2020 at 12:32
Maybe you can try building a distance matrix or something like that using the sklearn module. For your case the levenshtein distance may be interesting. — Plopp
– Plopp, Commented Jan 31, 2020 at 12:44

A2N15 · Accepted Answer · 2020-03-09 07:32:16Z

So I was able to speed up the matching step by using the postal code column as discriminant. I was able to goes from 1h40 to 7mn of computation.

Below are just some sample of DF's

df1 (127000,3)
Code     Name     PostalCode
150      Maarc    47111
250      Kirc     41111
170      Moic     42111
140      Nirc     44111
550      Lacter   47111

df2 (38000,3)
Code     NAME     POSTAL_CODE
150      Marc     47111
250      Kikc     41111
170      Mosc     49111
140      NiKc     44111
550      Lacter   47111

And below is the code that matches the Name column and retrieve the name with the best score

%%time
import difflib
from functools import partial

def difflib_match (df1, df2, set_nan = True):

    # Fill NaN
    df2['best']= np.nan
    df2['score']= np.nan

    # Apply function to retrieve unique first letter of Name's column
    first= df2['POSTAL_CODE'].unique()

    # Loop over each first letter to apply the matching by starting with the same Postal code for both DF
    for m, letter in enumerate(first):

        # IF Divid by 100, print Unique values processed 
        if m%100 == 0:
            print(m, 'of', len(first))

        df1_first= df1[df1['PostalCode'] == letter]
        df2_first= df2[df2['POSTAL_CODE'] == letter]

        # Function to match using the Name column from the Web                   
        f = partial(difflib.get_close_matches, possibilities= df1_first['Name'].tolist(), n=1) 

        # Define which columns to compare while mapping with first letter
        matches = df2_first['NAME'].map(f).str[0].fillna('')

        # Retrieve the best score for each match
        scores = [difflib.SequenceMatcher(None, x, y).ratio()
              for x, y in zip(matches, df2_first['NAME'])]

        # Assign the result to the DF
        for i, name in enumerate(df2_first['NAME']):
            df2['best'].where(df2['NAME'] != name, matches.iloc[i], inplace = True)
            df2['score'].where(df2['NAME'] != name, scores[i], inplace = True)

    return df2

# Apply Function
df_diff= difflib_match(df1, df2)

# Display DF
print('Shape: ', df_diff.shape)
df_diff.head()

Emanuel Hiroshi · Accepted Answer · 2020-01-31 13:13:55Z

-1

The fastest way I can think of matching string is using Regex.

It's a search language design to find matches in a string.

You can see a example here:

import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

//Outputs: x == true

*Taken from: https://www.w3schools.com/python/python_regex.asp

Since I don't understand anything Dataframe, I don't know how to implement Regex in your code, but I hope that Regex function might help you.

answered Jan 31, 2020 at 13:13

Emanuel Hiroshi

3983 silver badges9 bronze badges

2 Comments

tripleee Over a year ago

Why do you think regex will be faster than literal matching?

Emanuel Hiroshi Over a year ago

It's up for debate, but it usually depends on the complexity of the match and how well you can write your Regex as you can see in the following links: stackoverflow.com/questions/16638637/… blog.codinghorror.com/regex-performance

Collectives™ on Stack Overflow

Speed up matching strings python

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related