I know about approximate string searching and things like the Levenshtein distance, but what I want to do is take a large list of strings and quickly pick out any matching pairs that are similar to each other (say, 1 Damerau-Levenshtein distance apart). So something like this
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
matching_strings(l)
# Output
# [["moose","mouse"],["rat", "cat"]]
I only really know how to use R and Python, so bonus points if your solution can be easily implemented in one of those languages.
UPDATE:
Thanks to Collapsar's help, here is a solution in Python
import numpy
import functools
alphabet = {'a': 0, 'c': 2, 'b': 1, 'e': 4, 'd': 3, 'g': 6, 'f': 5, 'i': 8, 'h': 7, 'k': 10, 'j': 9, 'm': 12, 'l': 11, 'o': 14, 'n': 13, 'q': 16, 'p': 15, 's': 18, 'r': 17, 'u': 20, 't': 19, 'w': 22, 'v': 21, 'y': 24, 'x': 23, 'z': 25}
l = ["moose", "tiger", "lion", "mouse", "rat", "fish", "cat"]
fvlist=[]
for string in l:
fv = [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
for letter in string:
fv[alphabet[letter]]+=1
fvlist.append(fv)
fvlist.sort (key=functools.cmp_to_key(lambda fv1,fv2: numpy.sign(numpy.sum(numpy.subtract(fv1, fv2)))))
However, the sorted vectors are returned in the following order:
"rat" "cat" "lion" "fish" "moose" "tiger" "mouse"
Which I would consider to be sub-optimal because I would want moose and mouse to end up next to each other. I understand that however I sort these words there's no way to get all of the words next to all of their closest pairs. However, I am still open to alternative solutions