Edit Distance in Python

Question

I'm programming a spellcheck program in Python. I have a list of valid words (the dictionary) and I need to output a list of words from this dictionary that have an edit distance of 2 from a given invalid word.

I know I need to start by generating a list with an edit distance of one from the invalid word(and then run that again on all the generated words). I have three methods, inserts(...), deletions(...) and changes(...) that should output a list of words with an edit distance of 1, where inserts outputs all valid words with one more letter than the given word, deletions outputs all valid words with one less letter, and changes outputs all valid words with one different letter.

I've checked a bunch of places but I can't seem to find an algorithm that describes this process. All the ideas I've come up with involve looping through the dictionary list multiple times, which would be extremely time consuming. If anyone could offer some insight, I'd be extremely grateful.

You might want to look at Peter Norvig's spell checker (norvig.com/spell-correct.html) and modify it to suit your needs. — Richard Nienaber
– Richard Nienaber, Commented Mar 17, 2010 at 6:08
Figure 2.15 in this book shows pseudo-code for Levenshtein distance. The NLTK Python package has a function for computing this. — Renel Chesak
– Renel Chesak, Commented Jun 28, 2018 at 14:00

0x90 · Accepted Answer · 2020-11-26 00:29:22Z

82

The thing you are looking at is called an edit distance and here is a nice explanation on wiki. There are a lot of ways how to define a distance between the two words and the one that you want is called Levenshtein distance and here is a DP (dynamic programming) implementation in python.

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

And a couple of more implementations are here.

edited Nov 26, 2020 at 0:29

0x90

41.4k41 gold badges179 silver badges263 bronze badges

answered Sep 14, 2015 at 6:52

Salvador Dali

224k151 gold badges725 silver badges766 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

alancalvitti Over a year ago

@Salvador Dali Shouldn't adjacent transposition return a distance of 1? The above function gives levenshteinDistance("abc","bac") --> 2

Simd Over a year ago

@alancalvitti The only operations are insert, delete and substitute. So in you example you need to delete the a and then reinsert it between the b and c. Two operations.

Outcast Over a year ago

Hello :), do you have by chance a version of it (or of another algorithm) where you can specify an upper bound at the levenshtein distance? (You may check also my post for more details: stackoverflow.com/questions/59686989/…)

ryanjdillon · Accepted Answer · 2021-08-23 09:03:40Z

37

difflib in the standard library has various utilities for sequence matching, including the get_close_matches method that you could use. It uses an algorithm adapted from Ratcliff and Obershelp.

From the docs

>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

edited Aug 23, 2021 at 9:03

answered Apr 29, 2019 at 15:02

ryanjdillon

19.2k10 gold badges88 silver badges113 bronze badges

3 Comments

MERose Over a year ago

...except that this is not the edit/Levenshtein distance.

ryanjdillon Over a year ago

@MERose My answer is aimed at giving a solution to the problem that the question presents for other looking to solve the same problem-- and not to answer it per what appears to be a homework assignment's specifications.

dantiston Over a year ago

This is amazing!

firelynx · Accepted Answer · 2021-10-12 19:02:08Z

9

I would recommend not creating this kind of code on your own. There are libraries for that.

For instance the Levenshtein library.


In [2]: Levenshtein.distance("foo", "foobar")
Out[2]: 3

In [3]: Levenshtein.distance("barfoo", "foobar")
Out[3]: 6

In [4]: Levenshtein.distance("Buroucrazy", "Bureaucracy")
Out[4]: 3

In [5]: Levenshtein.distance("Misisipi", "Mississippi")
Out[5]: 3

In [6]: Levenshtein.distance("Misisipi", "Misty Mountains")
Out[6]: 11

In [7]: Levenshtein.distance("Buroucrazy", "Born Crazy")
Out[7]: 4

answered Oct 12, 2021 at 19:02

firelynx

32.5k10 gold badges94 silver badges104 bronze badges

Comments

Santosh · Accepted Answer · 2014-06-11 20:56:01Z

8

Here is my version for Levenshtein distance

def edit_distance(s1, s2):
    m=len(s1)+1
    n=len(s2)+1

    tbl = {}
    for i in range(m): tbl[i,0]=i
    for j in range(n): tbl[0,j]=j
    for i in range(1, m):
        for j in range(1, n):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)

    return tbl[i,j]

print(edit_distance("Helloworld", "HalloWorld"))

answered Jun 11, 2014 at 20:56

Santosh

3285 silver badges9 bronze badges

1 Comment

Artur Krajewski Over a year ago

Your code is incorrect. Logic is unsound and return tbl[i,j] should be return tbl[n,m].

frazman · Accepted Answer · 2015-09-15 05:14:51Z

7

#this calculates edit distance not levenstein edit distance
word1="rice"

word2="ice"

len_1=len(word1)

len_2=len(word2)

x =[[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance

for i in range(0,len_1+1): #initialization of base case values

    x[i][0]=i
for j in range(0,len_2+1):

    x[0][j]=j
for i in range (1,len_1+1):

    for j in range(1,len_2+1):

        if word1[i-1]==word2[j-1]:
            x[i][j] = x[i-1][j-1] 

        else :
            x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1

print x[i][j]

edited Sep 15, 2015 at 5:14

frazman

33.5k82 gold badges194 silver badges283 bronze badges

answered Nov 12, 2013 at 12:16

ishaan arora

5418 silver badges19 bronze badges

Comments

krassowski · Accepted Answer · 2021-04-21 18:51:21Z

6

Using the SequenceMatcher from Python built-in difflib is another way of doing it, but (as correctly pointed out in the comments), the result does not match the definition of an edit distance exactly. Bonus: it supports ignoring "junk" parts (e.g. spaces or punctuation).

from difflib import SequenceMatcher

a = 'kitten'
b = 'sitting'

required_edits = [
    code
    for code in (
        SequenceMatcher(a=a, b=b, autojunk=False)
        .get_opcodes()
    )
    if code[0] != 'equal'
]
required_edits
# [
#    # (tag, i1, i2, j1, j2)
#    ('replace', 0, 1, 0, 1), # replace a[0:1]="k" with b[0:1]="s"
#    ('replace', 4, 5, 4, 5), # replace a[4:5]="e" with b[4:5]="i"
#    ('insert', 6, 6, 6, 7),  # insert b[6:7]="g" after a[6:6]="n"
# ]


# the edit distance:
len(required_edits)  # == 3

edited Apr 21, 2021 at 18:51

answered Jun 1, 2020 at 16:03

krassowski

15.7k4 gold badges75 silver badges114 bronze badges

1 Comment

Russ Over a year ago

Running this for '123' and '12345' produces a required_edits of length 1, while the edit distance is 2.

J.Michael · Accepted Answer · 2021-08-27 07:51:19Z

1

Similar to Santoshi's solution above but I made three changes:

One line initialization instead of five
No need to define cost alone (just use int(boolean) 0 or 1)
Instead of double for loop use product, (this last one is only cosmetic, double loop seems unavoidable)

from itertools import product

def edit_distance(s1,s2):      
   d={ **{(i,0):i for i in range(len(s1)+1)},**{(0,j):j for j in range(len(s2)+1)}}
   for i, j in product(range(1,len(s1)+1), range(1,len(s2)+1)): 
       d[i,j]=min((s1[i-1]!=s2[j-1]) + d[i-1,j-1], d[i-1,j]+1, d[i,j-1]+1)
   return d[i,j]

edited Aug 27, 2021 at 7:51

answered Nov 12, 2020 at 14:16

J.Michael

517 bronze badges

2 Comments

Glenn Slayden Over a year ago

You are referencing a variable s (i.e., s[i-1]!=s[j-1]) which is not defined in your code.

J.Michael Over a year ago

Thanks made the correction! Also noted that in Python True + n = n+1, and False + n =n so no need to explicitly convert to type 'int'

pacholik · Accepted Answer · 2017-04-01 13:13:31Z

0

Instead of going with Levenshtein distance algo use BK tree or TRIE, as these algorithms have less complexity then edit distance. A good browse over these topic will give a detailed description.

This link will help you more about spell checking.

edited Apr 1, 2017 at 13:13

pacholik

9,0529 gold badges48 silver badges59 bronze badges

answered Apr 1, 2017 at 12:54

jt26

314 bronze badges

Comments

Inaam Ilahi · Accepted Answer · 2019-01-15 14:23:56Z

You need Minimum Edit Distance for this task.

Following is my version of MED a.k.a Levenshtein Distance.

def MED_character(str1,str2):
    cost=0
    len1=len(str1)
    len2=len(str2)

    #output the length of other string in case the length of any of the string is zero
    if len1==0:
        return len2
    if len2==0:
        return len1

    accumulator = [[0 for x in range(len2)] for y in range(len1)] #initializing a zero matrix

    # initializing the base cases
    for i in range(0,len1):
        accumulator[i][0] = i;
    for i in range(0,len2):
        accumulator[0][i] = i;

    # we take the accumulator and iterate through it row by row. 
    for i in range(1,len1):
        char1=str1[i]
        for j in range(1,len2):
            char2=str2[j]
            cost1=0
            if char1!=char2:
                cost1=2 #cost for substitution
            accumulator[i][j]=min(accumulator[i-1][j]+1, accumulator[i][j-1]+1, accumulator[i-1][j-1] + cost1 )

    cost=accumulator[len1-1][len2-1]
    return cost

Zac S.Y · Accepted Answer · 2022-04-25 05:17:37Z

0

Fine tuned codes based on the version from @Santosh and should address the issue brought up by @Artur Krajewski; The biggest difference is replacing an effective 2d matrix


def edit_distance(s1, s2):
# add a blank character for both strings
    m=len(s1)+1
    n=len(s2)+1
# launch a matrix
    tbl = [[0] * n for i in range(m)] 
    for i in range(m): tbl[i][0]=i
    for j in range(n): tbl[0][j]=j

    for i in range(1, m):
        for j in range(1, n):
#if strings have same letters, set operation cost as 0 otherwise 1
            cost = 0 if s1[i-1] == s2[j-1] else 1
#find min practice
            tbl[i][j] = min(tbl[i][j-1]+1, tbl[i-1][j]+1, tbl[i-1][j-1]+cost)
    return tbl

edit_distance("birthday", "Birthdayyy")

edited Apr 25, 2022 at 5:17

answered Apr 25, 2022 at 4:39

Zac S.Y

11 bronze badge

Comments

Harry Moreno · Accepted Answer · 2022-06-08 16:33:37Z

0

following up on @krassowski's answer

from difflib import SequenceMatcher

def sequence_matcher_edits(word_a, word_b):
  required_edits = [code for code in (
      SequenceMatcher(a=word_a, b=word_b, autojunk=False).get_opcodes()
    )
    if code[0] != 'equal'
  ]
  return len(required_edits)

print(f"sequence_matcher_edits {sequence_matcher_edits('kitten', 'sitting')}")
# -> sequence_matcher_edits 3

answered Jun 8, 2022 at 16:33

Harry Moreno

11.9k8 gold badges75 silver badges120 bronze badges

Comments

SelfTaughtProgrammer · Accepted Answer · 2025-03-19 21:35:29Z

0

nltk.edit_distance() is the first reputable implementation I could find of Levenshtein edit-distance.

python
from nltk import edit_distance

# Prints '2', one deletion plus one insertion
print(edit_distance('apple','appel'))

Mostly just sharing this because it's what I was looking for when I found this page.

answered Mar 19 at 21:35

SelfTaughtProgrammer

72 bronze badges

Collectives™ on Stack Overflow

Edit Distance in Python

12 Answers 12

3 Comments

3 Comments

Comments

1 Comment

Comments

1 Comment

2 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

3 Comments

3 Comments

Comments

1 Comment

Comments

1 Comment

2 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related