0

I'm comparing two dataframe columns in Python, with the goal of finding, for each element of first column, the best match of the second one. The first column contains 19.000 rows, and I need to check for every string of it what is the best match of the second column. So, it is need to check 19.000 rows, 19.000 times each row, taking into consideration that the string itself has to be another one, not the same.

I have started with a simple comparison, finding a string in a list, and I succedeed. Then I applied it to a list, just to compare both of them, but obviously, gives the error "TypeError: expected string or bytes-like object", due to comparing string vs list. Finally, I have tried to create a loop, but the error is the same. Is there a way to create a list with the results expected? Maybe there is a better way to do it with another library, but, so far, I have found nothing. Here is the code at the moment:

#simple example
from fuzzywuzzy import process
string = "appl"
compare = ["adfad.","apple","asple","tab"]
Ratios = process.extract(string,compare)
print(Ratios)
[('apple', 89), ('asple', 67), ('tab', 29), ('adfad.', 22)]

highest = process.extractOne(string,compare)
print(highest)
('apple', 89)

#data frame
from fuzzywuzzy import process
dataframecolumn = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
Ratios = process.extract(dataframecolumn,compare)
TypeError: expected string or bytes-like object

#expected (but I need a list)
highest = process.extractOne(dataframecolumn[0],compare)
print(highest)
('apple', 89)
highest = process.extractOne(dataframecolumn[1],compare)
print(highest)
('tab', 80)

#Result expected
results = ["apple, 89","tab, 80"]

#Error
myl = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
results = []
for x in myl:
    results.append(process.extractOne(myl,compare)[1])
TypeError: expected string or bytes-like object

1 Answer 1

1
from operator import itemgetter 

dataframecolumn = ["appl","tb"]
compare = ["adfad.","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
print ([max(ratios, key = itemgetter(1)) for ratios in Ratios])

# Or oneliner
#Ratios = [max(process.extract(x,compare),key = itemgetter(1)) for x in dataframecolumn]

If extract will always return the sorted results then we can avoid call to max

Ratios = [process.extract(x, compare)[0] for x in dataframecolumn]

Output:

[('apple', 89), ('tab', 80)]

If you want to skip the exact matches and only get the fuzzy matches then, just skip the matches which has a score of 100% and get the first non 100% match since it is already sorted.

dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
    for match in ratio:
        if match[1] != 100:
            result.append(match)
            break
print (result) 
Sign up to request clarification or add additional context in comments.

7 Comments

What if I want to get the second result? The idea is to compare the same column. For example, dataframecolumn = ["apple","tb"], compare = ["adfad.","apple","asple","tab"] should give "asple".
Ratios = [process.extract(x, compare)[1] for x in dataframecolumn] Output: [('asple', 67), ('adfad.', 0)], but should be: [('asple', 67), ('tab', 80)]
@ecp in case you want to skip exact matches just skip 100% scores. Check the update.
In some cases the solution does not give all results. For example, from fuzzywuzzy import process dataframecolumn = ["apple","tb"] compare = ["apple","apple"] Ratios = [process.extract(x,compare) for x in dataframecolumn] result = list() for ratio in Ratios: for match in ratio: if match[1] != 100: result.append(match) break print (result)
When there is a duplicate, should return "apple" (2n one) in the given example. How can I achieve this? Thanks again!!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.