9

I know about approximate string searching and things like the Levenshtein distance, but what I want to do is take a large list of strings and quickly pick out any matching pairs that are similar to each other (say, 1 Damerau-Levenshtein distance apart). So something like this

l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]

matching_strings(l)

# Output
# [["moose","mouse"],["rat", "cat"]]

I only really know how to use R and Python, so bonus points if your solution can be easily implemented in one of those languages.

UPDATE:

Thanks to Collapsar's help, here is a solution in Python

import numpy
import functools
alphabet = {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'k': 10, 'j': 9, 'm': 12, 'l': 11, 'o': 14, 'n': 13, 'q': 16, 'p': 15, 's': 18, 'r': 17, 'u': 20, 't': 19, 'w': 22, 'v': 21, 'y': 24, 'x': 23, 'z': 25}


l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
fvlist=[]

for string in l:
    fv = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    for letter in string:
        fv[alphabet[letter]]+=1
    fvlist.append(fv)

fvlist.sort (key=functools.cmp_to_key(lambda fv1,fv2: numpy.sign(numpy.sum(numpy.subtract(fv1, fv2)))))

However, the sorted vectors are returned in the following order:

"rat" "cat" "lion" "fish" "moose" "tiger" "mouse"

Which I would consider to be sub-optimal because I would want moose and mouse to end up next to each other. I understand that however I sort these words there's no way to get all of the words next to all of their closest pairs. However, I am still open to alternative solutions

2
  • This is an excellent question, and I have read the answers below. There is a lot of variety and creative ideas. I would like to know if there is a way to find the 2 closest strings in a list of 10,000 strings, where the measure of "distance" is not built-in to the search algorithm, but can be injected ( levenstein or other ). Can it be done in less than O(n-squared)? Commented Dec 13, 2019 at 17:13
  • This problem could be solved efficiently using a metric tree. Commented Apr 20, 2022 at 21:14

3 Answers 3

4

One way to do that (with complexity O(n k^2), where n is number of strings and k is the longest string) is to convert every string into a set of masks like this:

rat => ?at, r?t, ra?, ?rat, r?at, ra?t, rat?

This way if two words are different in one letter, like 'rat' and 'cat', they will both have a mask ?at among others, while if one word is a subsequence of another, like 'rat' and 'rats', they will both have mask 'rat?'.

Then you just group strings based on their masks, and print groups that have more than two strings. You might want to dedup your array first, if it has duplicates.

Here's an example code, with an extra cats string in it.

l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat", "cats"]
d = {}

def add(mask, s):
    if mask not in d:
        d[mask] = []
    d[mask].append(s)

for s in l:
    for pos in range(len(s)):
        add(s[:pos] + '?' + s[pos + 1:], s)
        add(s[:pos] + '?' + s[pos:], s)
    add(s + '?', s)

for k, v in d.items():
    if len(v) > 1:
        print v

Outputs

['moose', 'mouse']
['rat', 'cat']
['cat', 'cats']
Sign up to request clarification or add additional context in comments.

3 Comments

This is great, thanks. I know I didn't specify this in the original question, but is there any way to detect swapped letters with this method? For instance, having "wasp" and "waps" get paired to each other. Either way, this might be what I end up using
Are you only looking at swapping adjacent letters, or wasp and pasw should also match? If you are only looking at adjacent letters, just populate a dictionary of all the strings with every pair of letters swapped, and then iterate over your strings again and query the dictionary. Or is it too slow?
Instead of that add function that deals with the case of a missing key, you could also use a defaultdict(list), and then simply call l = d[mask] and l.append multiple times.
2
+50

1st step, you must index your list with any fuzzy search indexing.

2nd, you needed iterate your list and search for neighbors by quick lookup in the pre-indexed list.

About fuzzy indexing:

Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification is named "Wilbur-Khovayko algorithm".

Basic idea: To split strings by trigrams, and search maximal intersection scores.

For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", and so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.

Thereafter, for each trigram built index, in which term it is present.

So, this is list of term_ID for each trigram.

When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.

It works quick: I remember, on old Sun/Solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s

You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tgz

2 Comments

Thanks for your help. I don't know any C, but I was able to find a python module that does trigram matching. The module is called Fuzzyset and it was able to handle my dataset of ~13000 words in about 5 minutes. Not ideal, but acceptable. Thanks for pointing me in the right direction
You're welcome. Thanks for appreciate my efforts. Also, if you needed increase performance of your code - you can use RADIX50 code for convert trigram to 16-bit value. That value you can just use as index in table, contains list of termIDs. This technology used in the my fuzzy engine. Details see here: en.wikipedia.org/wiki/DEC_Radix-50
1

The naive implementation amounts to setting up a boolean matrix indexed by the strings (i.e. their position in the sorted list) and comparing each pair of strings, setting the corresponding matrix element to true iff the strings are 'similar' wrt your criterion. This will run in O(n^2).

You might be better off by transforming your strings into tuples of character frequencies ( e.g. 'moose' -> (0,0,0,0,1,0,0,0,0,0,0,0,1,0,2,0,0,0,1,0,0,0,0,0,0,0) where the i-th vector component represents the i-th letter in the alphabet). Note that the frequency vectors will differ in 'few' components only ( e.g. for D-L distance 1 in at most 2 components, the respective differences being +1,-1 ).

Sort your transformed data. Candidates for the pairs you wish to generate will be adjacent or at least 'close' to each other in your sorted list of transformed values. You check the candidates by comparing each list entry with at most k of its successors in the list, k being a small integer (actually comparing the corresponding strings, of course). This algorithm will run in O(n log n).

You have to trade off between the added overhead of transformation / sorting (with complex comparison operations depending on the representation you choose for the frequency vectors ) and the reduced number of comparisons. The method does not consider the intra-word position of characters but only their occurrence. Depending on the actual set of strings there'll be many candidate pairs that do not turn into actually 'similar' pairs.

As your data appears to consist of English lexemes, a potential optimisation would be to define character classes ( e.g. vowels/consonants, 'e'/other vowels/syllabic consonants/non-syllabic consonants ) instead of individual characters.

Additional optimisation:

Note that precisely the pairs of strings in your data set that are permutations of each other (e.g. [art,tar]) will produce identical values under the given transformation. so if you limit yourself to a D-L distance of 1 and if you do not consider the transposition of adjacent characters as a single edit step, never pick list items with identical transformation values as candidates.

8 Comments

Thank you a ton for your help, this is pretty much what I was looking for. This may be a silly question, but could you go into detail about the best way to sort the tuples of character frequencies?
You specify a comparison function ( cmp ) for your programming environments's generic sort routine. cmp may be the sum of absolute component differences. In general, you need some cmp that ignores the component identity.
What criteria should I be sorting the character frequencies by? Grouping them by number of as, then number of bs, then number of cs etc. seems arbitrary and uninformative, but nothing else springs to mind
Example (Python): use numpy.sign, numpy.sum, numpy.subtract to compare two feature vectors when sorting: fvlist.sort ( key=functools.cmp_to_key ( lambda fv1,fv2: return numpy.sign(numpy.sum(numpy.subtract(fv1, fv2))) ) );. Caveat: I'm not a python programmer so this codesnippet may be syntactically incorrect.
Thank you so much for your help, and sorry for all the questions. I'm not sure if this is going to work, though, because when I ran that code it sorted the example words as "rat, cat, lion, fish, moose, tiger, mouse". Rat and cat are next to each other, which is good, but moose and mouse are not. It should still help, though, because I guess I can compare each word to its four nearest neighbours, or something like that, instead of comparing each word to all other words
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.