1

I am writing a python code to compute if there is any fuzzy match between 2 strings. If there is a match, I have to store both the strings and the avg match value. The string to be compared are from a list that runs into thousands of entries The issue is that the code is taking too long to execute. To speed up, I looked the other answers in here but none of them had multiple return values from the inner function in the loop. Looking for optimized code here...

tokens=['abc','bcd','abe','efg','opq']
valid_list=['acb','abc','abf','bcd','rts','xyz']
for i in tokens:
    for j in valid_list:
        token,valid_entry,avg_match=get_match(i,j)
        if(token!=0):
            potential_entry.append(valid_entry)
            match_tokens.append(token)
            ag_match.append(avg_match)

def get_match(i,j):

   avg_value=(fuzz.ratio(token,chk_str)+fuzz.partial_ratio(token,chk_str)+fuzz.token_sort_ratio(token,chk_str)+fuzz.token_set_ratio(token,chk_str))/4
    if(int(avg_value)>70):
        return token,chk_Str,int(avg_value)
    else:
        return 0,0,0
1
  • yes plz. I want to check the match of each token in input to each token in the valid_list. Commented Nov 22, 2019 at 10:35

1 Answer 1

1

The main obvious thing I can see is that you could short circuit out of the fuzzy checks if any are clearly not going to be a valid match.

So instead of doing them all in one line, do them individually, and check if they are below a threshold before getting the other ratios, prioritise checking the ratio you'd expect to provide the clearest answer for this first.

Also, consider:

  • using a single list of an object to avoid having to append to three lists
  • using sets for your tokens and valid list to ensure there aren't any duplicate checks being done
  • not casting the avg_value to an integer for the if statement, it doesn't really make a difference here.
  • add in an explicit i == j check to return a 100% ratio before doing any other checks
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @Sayse for the recommendations. I have removed the duplicates from both the lists. If I do not cast the value, I was getting float value error. I have also removed the tokens that are already matching exactly. The code I posted here is the example and so you see the exact match case. I will try to implement the list of object point.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.