3

I have 2 different Dataframes for which I am trying to match strings columns (names)

Below are just some sample of DF's

df1 (127000,3)
Code     Name     PostalCode
150      Maarc    47111
250      Kirc     41111
170      Moic     42111
140      Nirc     44111
550      Lacter   47111

df2 (38000,3)
Code     NAME     POSTAL_CODE
150      Marc     47111
250      Kikc     41111
170      Mosc     49111
140      NiKc     44111
550      Lacter   47111

The aim is to create another DF3 as shown below

Code     NAME    Best Match   Score
150      Marc    Maarc        0.9
250      Karc    Kirc         0.9

The following code gives the expected output

import difflib
from functools import partial
f = partial(difflib.get_close_matches, possibilities= df1['Name'].tolist(), n=1)

matches = df2['NAME'].map(f).str[0].fillna('')

scores = [difflib.SequenceMatcher(None, x, y).ratio()
          for x, y in zip(matches, df2['NAME'])]

df3 = df2.assign(best=matches, score=scores)
df3.sort_values(by='score')

The problem

To match those strings for only 2 rows it takes around 30sec. This task has to be done for 1K rows which will take hours!

The Question

How can I speed up the code? I was thinking about something like fetchall?

EDIT

Even the fuzzywuzzy libraries has been tried, which takes longer than difflib with the following code:

from fuzzywuzzy import fuzz

def get_fuzz(df, w):
    s = df['Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
    idx = s.idxmax()
    return {'Name': df['Name'].iloc[idx], 'CODE': df['Code'].iloc[idx], 'Value': s.max()}

df2['NAME'].apply(lambda x: get_fuzz(df1, x))

df2 = df2.assign(search= df2['NAME'].apply(lambda x: get_fuzz(df1, x)))
2
  • 2
    unfortunately I don’t think difflib is the right tool for this task, it’s not that fast Commented Jan 31, 2020 at 12:32
  • 1
    Maybe you can try building a distance matrix or something like that using the sklearn module. For your case the levenshtein distance may be interesting. Commented Jan 31, 2020 at 12:44

2 Answers 2

1

So I was able to speed up the matching step by using the postal code column as discriminant. I was able to goes from 1h40 to 7mn of computation.

Below are just some sample of DF's

df1 (127000,3)
Code     Name     PostalCode
150      Maarc    47111
250      Kirc     41111
170      Moic     42111
140      Nirc     44111
550      Lacter   47111

df2 (38000,3)
Code     NAME     POSTAL_CODE
150      Marc     47111
250      Kikc     41111
170      Mosc     49111
140      NiKc     44111
550      Lacter   47111

And below is the code that matches the Name column and retrieve the name with the best score

%%time
import difflib
from functools import partial

def difflib_match (df1, df2, set_nan = True):

    # Fill NaN
    df2['best']= np.nan
    df2['score']= np.nan

    # Apply function to retrieve unique first letter of Name's column
    first= df2['POSTAL_CODE'].unique()

    # Loop over each first letter to apply the matching by starting with the same Postal code for both DF
    for m, letter in enumerate(first):

        # IF Divid by 100, print Unique values processed 
        if m%100 == 0:
            print(m, 'of', len(first))

        df1_first= df1[df1['PostalCode'] == letter]
        df2_first= df2[df2['POSTAL_CODE'] == letter]

        # Function to match using the Name column from the Web                   
        f = partial(difflib.get_close_matches, possibilities= df1_first['Name'].tolist(), n=1) 

        # Define which columns to compare while mapping with first letter
        matches = df2_first['NAME'].map(f).str[0].fillna('')

        # Retrieve the best score for each match
        scores = [difflib.SequenceMatcher(None, x, y).ratio()
              for x, y in zip(matches, df2_first['NAME'])]

        # Assign the result to the DF
        for i, name in enumerate(df2_first['NAME']):
            df2['best'].where(df2['NAME'] != name, matches.iloc[i], inplace = True)
            df2['score'].where(df2['NAME'] != name, scores[i], inplace = True)

    return df2

# Apply Function
df_diff= difflib_match(df1, df2)

# Display DF
print('Shape: ', df_diff.shape)
df_diff.head()
Sign up to request clarification or add additional context in comments.

Comments

-1

The fastest way I can think of matching string is using Regex.

It's a search language design to find matches in a string.

You can see a example here:

import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

//Outputs: x == true

*Taken from: https://www.w3schools.com/python/python_regex.asp

Since I don't understand anything Dataframe, I don't know how to implement Regex in your code, but I hope that Regex function might help you.

2 Comments

Why do you think regex will be faster than literal matching?
It's up for debate, but it usually depends on the complexity of the match and how well you can write your Regex as you can see in the following links: stackoverflow.com/questions/16638637/… blog.codinghorror.com/regex-performance

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.