Algorithm to find similar strings in a list of many strings

Question

I know about approximate string searching and things like the Levenshtein distance, but what I want to do is take a large list of strings and quickly pick out any matching pairs that are similar to each other (say, 1 Damerau-Levenshtein distance apart). So something like this

l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]

matching_strings(l)

# Output
# [["moose","mouse"],["rat", "cat"]]

I only really know how to use R and Python, so bonus points if your solution can be easily implemented in one of those languages.

UPDATE:

Thanks to Collapsar's help, here is a solution in Python

import numpy
import functools
alphabet = {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'k': 10, 'j': 9, 'm': 12, 'l': 11, 'o': 14, 'n': 13, 'q': 16, 'p': 15, 's': 18, 'r': 17, 'u': 20, 't': 19, 'w': 22, 'v': 21, 'y': 24, 'x': 23, 'z': 25}


l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
fvlist=[]

for string in l:
    fv = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    for letter in string:
        fv[alphabet[letter]]+=1
    fvlist.append(fv)

fvlist.sort (key=functools.cmp_to_key(lambda fv1,fv2: numpy.sign(numpy.sum(numpy.subtract(fv1, fv2)))))

However, the sorted vectors are returned in the following order:

"rat" "cat" "lion" "fish" "moose" "tiger" "mouse"

Which I would consider to be sub-optimal because I would want moose and mouse to end up next to each other. I understand that however I sort these words there's no way to get all of the words next to all of their closest pairs. However, I am still open to alternative solutions

This is an excellent question, and I have read the answers below. There is a lot of variety and creative ideas. I would like to know if there is a way to find the 2 closest strings in a list of 10,000 strings, where the measure of "distance" is not built-in to the search algorithm, but can be injected ( levenstein or other ). Can it be done in less than O(n-squared)? — Nick Perkins
– Nick Perkins, Commented Dec 13, 2019 at 17:13
This problem could be solved efficiently using a metric tree. — Anderson Green
– Anderson Green, Commented Apr 20, 2022 at 21:14

Ishamael · Accepted Answer · 2015-02-05 21:28:27Z

4

One way to do that (with complexity O(n k^2), where n is number of strings and k is the longest string) is to convert every string into a set of masks like this:

rat => ?at, r?t, ra?, ?rat, r?at, ra?t, rat?

This way if two words are different in one letter, like 'rat' and 'cat', they will both have a mask ?at among others, while if one word is a subsequence of another, like 'rat' and 'rats', they will both have mask 'rat?'.

Then you just group strings based on their masks, and print groups that have more than two strings. You might want to dedup your array first, if it has duplicates.

Here's an example code, with an extra cats string in it.

l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat", "cats"]
d = {}

def add(mask, s):
    if mask not in d:
        d[mask] = []
    d[mask].append(s)

for s in l:
    for pos in range(len(s)):
        add(s[:pos] + '?' + s[pos + 1:], s)
        add(s[:pos] + '?' + s[pos:], s)
    add(s + '?', s)

for k, v in d.items():
    if len(v) > 1:
        print v

Outputs

['moose', 'mouse']
['rat', 'cat']
['cat', 'cats']

answered Feb 5, 2015 at 21:28

Ishamael

12.8k4 gold badges38 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

C_Z_ Over a year ago

This is great, thanks. I know I didn't specify this in the original question, but is there any way to detect swapped letters with this method? For instance, having "wasp" and "waps" get paired to each other. Either way, this might be what I end up using

Ishamael Over a year ago

Are you only looking at swapping adjacent letters, or wasp and pasw should also match? If you are only looking at adjacent letters, just populate a dictionary of all the strings with every pair of letters swapped, and then iterate over your strings again and query the dictionary. Or is it too slow?

Joschua Over a year ago

Instead of that add function that deals with the case of a missing key, you could also use a defaultdict(list), and then simply call l = d[mask] and l.append multiple times.

hurturk · Accepted Answer · 2017-05-25 14:18:36Z

2

+50

1st step, you must index your list with any fuzzy search indexing.

2nd, you needed iterate your list and search for neighbors by quick lookup in the pre-indexed list.

About fuzzy indexing:

Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification is named "Wilbur-Khovayko algorithm".

Basic idea: To split strings by trigrams, and search maximal intersection scores.

For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", and so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.

Thereafter, for each trigram built index, in which term it is present.

So, this is list of term_ID for each trigram.

When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.

It works quick: I remember, on old Sun/Solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s

You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tgz

edited May 25, 2017 at 14:18

hurturk

5,45427 silver badges42 bronze badges

answered Feb 5, 2015 at 22:19

olegarch

3,9011 gold badge22 silver badges22 bronze badges

2 Comments

C_Z_ Over a year ago

Thanks for your help. I don't know any C, but I was able to find a python module that does trigram matching. The module is called Fuzzyset and it was able to handle my dataset of ~13000 words in about 5 minutes. Not ideal, but acceptable. Thanks for pointing me in the right direction

olegarch Over a year ago

You're welcome. Thanks for appreciate my efforts. Also, if you needed increase performance of your code - you can use RADIX50 code for convert trigram to 16-bit value. That value you can just use as index in table, contains list of termIDs. This technology used in the my fuzzy engine. Details see here: en.wikipedia.org/wiki/DEC_Radix-50

collapsar · Accepted Answer · 2015-02-03 18:27:55Z

1

The naive implementation amounts to setting up a boolean matrix indexed by the strings (i.e. their position in the sorted list) and comparing each pair of strings, setting the corresponding matrix element to true iff the strings are 'similar' wrt your criterion. This will run in O(n^2).

You might be better off by transforming your strings into tuples of character frequencies ( e.g. 'moose' -> (0,0,0,0,1,0,0,0,0,0,0,0,1,0,2,0,0,0,1,0,0,0,0,0,0,0) where the i-th vector component represents the i-th letter in the alphabet). Note that the frequency vectors will differ in 'few' components only ( e.g. for D-L distance 1 in at most 2 components, the respective differences being +1,-1 ).

Sort your transformed data. Candidates for the pairs you wish to generate will be adjacent or at least 'close' to each other in your sorted list of transformed values. You check the candidates by comparing each list entry with at most k of its successors in the list, k being a small integer (actually comparing the corresponding strings, of course). This algorithm will run in O(n log n).

You have to trade off between the added overhead of transformation / sorting (with complex comparison operations depending on the representation you choose for the frequency vectors ) and the reduced number of comparisons. The method does not consider the intra-word position of characters but only their occurrence. Depending on the actual set of strings there'll be many candidate pairs that do not turn into actually 'similar' pairs.

As your data appears to consist of English lexemes, a potential optimisation would be to define character classes ( e.g. vowels/consonants, 'e'/other vowels/syllabic consonants/non-syllabic consonants ) instead of individual characters.

Additional optimisation:

Note that precisely the pairs of strings in your data set that are permutations of each other (e.g. [art,tar]) will produce identical values under the given transformation. so if you limit yourself to a D-L distance of 1 and if you do not consider the transposition of adjacent characters as a single edit step, never pick list items with identical transformation values as candidates.

edited Feb 3, 2015 at 18:27

answered Feb 3, 2015 at 18:06

collapsar

17.4k5 gold badges40 silver badges66 bronze badges

8 Comments

C_Z_ Over a year ago

Thank you a ton for your help, this is pretty much what I was looking for. This may be a silly question, but could you go into detail about the best way to sort the tuples of character frequencies?

collapsar Over a year ago

You specify a comparison function ( cmp ) for your programming environments's generic sort routine. cmp may be the sum of absolute component differences. In general, you need some cmp that ignores the component identity.

C_Z_ Over a year ago

What criteria should I be sorting the character frequencies by? Grouping them by number of as, then number of bs, then number of cs etc. seems arbitrary and uninformative, but nothing else springs to mind

collapsar Over a year ago

Example (Python): use numpy.sign, numpy.sum, numpy.subtract to compare two feature vectors when sorting: fvlist.sort ( key=functools.cmp_to_key ( lambda fv1,fv2: return numpy.sign(numpy.sum(numpy.subtract(fv1, fv2))) ) );. Caveat: I'm not a python programmer so this codesnippet may be syntactically incorrect.

C_Z_ Over a year ago

Thank you so much for your help, and sorry for all the questions. I'm not sure if this is going to work, though, because when I ran that code it sorted the example words as "rat, cat, lion, fish, moose, tiger, mouse". Rat and cat are next to each other, which is good, but moose and mouse are not. It should still help, though, because I guess I can compare each word to its four nearest neighbours, or something like that, instead of comparing each word to all other words

|

Collectives™ on Stack Overflow

Algorithm to find similar strings in a list of many strings

3 Answers 3

3 Comments

2 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related