Knuth–Morris–Pratt string match algorithm

Question

The Knuth–Morris–Pratt string search algorithm is described in the paper Fast Pattern Matching in Strings (SIAM J. Computing vol. 6 no. 2, June 1977). The initial step of the algorithm is to compute the next table, defined as follows:

The pattern-matching process will run efficiently if we have an auxiliary table that tells us exactly how far to slide the pattern, when we detect a mismatch at its jth character pattern[j]. Let next[j] be the character position in the pattern which should be checked next after such a mismatch, so that we are sliding the pattern j − next[j] places relative to the text.

The authors give the example of the pattern abcabcacab. If there is a mismatch at j=7:

abcabcacab
abcabca?

Then the pattern should be moved 3 places to the right and matching should continue with the 4th character of the pattern:

   abcabcacab
abcabca?

so next[7] = 4. In some cases we know we can skip the mismatched character entirely, for example if there is a mismatch at j=3:

abcabcacab
abc?

then the search should continue from the character after the mismatch:

    abcabcacab
abc?

These special cases are indicated by next[j] = −1.

(If you're reading the paper, note that the authors use indexes starting at 1 as in Fortran, but the Python convention is to use indexes starting at 0, so that's what I'm giving here.)

This is the code that computes the next table. Please review.

def findPattern(pattern):

    j = -1
    next = [-1] * len(pattern)
    i = 0 # next[0] is always -1, by KMP definition

    while (i+1 < len(pattern)):
        if (j == -1) or (pattern[j] == pattern[i]):
            i += 1
            j += 1
            if pattern[i] != pattern[j]:
                next[i] = j
            else:
                next[i] = next[j]
        else:
            j = next[j]

    return next

if __name__ == "__main__":

    print findPattern("aaaab")
    print findPattern("abaabc")

Output:

[-1, -1, -1, -1, 3]
[-1, 0, -1, 1, 0, 2]

What is the expected output? What kind of patterns are you expecting to find? — holroy
– holroy, Commented Oct 18, 2015 at 6:14
Could you please write with words, what that output means? It's still a little unclear, and that makes it harder to provide a good review. — holroy
– holroy, Commented Oct 18, 2015 at 7:00
The algorithm linked is for detecting strings, but you said you're using it to create patterns? It's very hard to follow what your code is for in its current state. — SuperBiasedMan
– SuperBiasedMan, Commented Oct 18, 2015 at 11:28
I think that this is supposed to be the "table-building" part of the Knuth–Morris–Pratt algorithm. However, it doesn't build the same table as the algorithm given in Wikipedia, where it says the word ABCDABCD becomes the table [-1, 0, 0, 0, 0, 1, 2, 3], but findPattern("ABCDABCD") returns [-1, 0, 0, 0, -1, 0, 0, 0]. So either there's a bug in your code, or you are implementing some other table-building function and need to explain in more detail. — Gareth Rees
– Gareth Rees, Commented Oct 18, 2015 at 12:27
I have been reading the original Knuth–Morris–Pratt paper, from which I have learned that the Wikipedia article is seriously misleading — the algorithm it describes is not the same as the one in the KMP paper. The T table described in the Wikipedia article is the same as the f table in KMP — but the f table is just a step in the actual construction of the next table, which is what the KMP algorithm actually uses. So ignore what I said about failing to match the Wikipedia algorithm. — Gareth Rees
– Gareth Rees, Commented Oct 20, 2015 at 20:04

Gareth Rees · Accepted Answer · 2015-10-20 22:06:13Z

1. Review

There's no docstring.
There's no need for parentheses around conditions (Python is not C), so instead of:
```
while (i+1 < len(pattern)):
```
you can write:
```
while i+1 < len(pattern):
```
The loop while i+1 < len(pattern) calls the len function on each iteration, even though pattern has not changed. You could avoid this wasted call by caching len(pattern) in a local variable.
The or operator has lower precedence than comparison operators, so instead of:
```
if (j == -1) or (pattern[j] == pattern[i]):
```
you can omit the parentheses:
```
if j == -1 or pattern[j] == pattern[i]:
```
When there's a choice about whether to test for equality or inequality, then I think it's usually clearer to test for equality, so I would write if pattern[i] == pattern[j] instead of if pattern[i] != pattern[j].
There's a small inefficiency in your code. If the test j == -1 or pattern[j] == pattern[i] passes then you set j = next[j] and go round the while loop again. But the condition on the while loop is a condition on i, which has not changed, so you waste the test. It is better to go straight to the test on j, like this:
```
m = len(pattern)
while i + 1 < m
    while j > -1 and pattern[i] != pattern[j]:
        j = next[j]
    i += 1
    j += 1
    if pattern[i] == pattern[j]:
        next[i] = next[j]
    else:
        next[i] = j
```
After making this change, i always increases by 1 on each iteration of the main loop, so we could use a for loop instead to make this clear.

2. Revised code

def kmp_table(pattern):
    """Compute the "next" table corresponding to pattern, for use in the
    Knuth-Morris-Pratt string search algorithm.

    """
    m = len(pattern)
    next = [-1] * m
    j = -1
    for i in range(1, m):
        while j > -1 and pattern[i-1] != pattern[j]:
            j = next[j]
        j += 1
        if pattern[i] != pattern[j]:
            next[i] = j
        else:
            next[i] = next[j]
    return next

Hi Gareth, accepted for your reply and appreciate for the learning. Wondering if any other functional bugs in your mind? For functional I mean the next[] is not generated correctly. :) — Lin Ma
– Lin Ma, Commented Oct 20, 2015 at 22:17
The logic looks identical to that given in KMP, and it gives the same results as the examples in the paper. I haven't checked it beyond that. — Gareth Rees
– Gareth Rees, Commented Oct 20, 2015 at 22:30
Thanks Gareth, as long as you did not find any bugs, I am confident. :) — Lin Ma
– Lin Ma, Commented Oct 20, 2015 at 22:40
Thanks for all the help Gareth, mark your reply as an answer. Have a good weekend. :) — Lin Ma
– Lin Ma, Commented Oct 25, 2015 at 2:14

Caridorc · Accepted Answer · 2015-10-20 22:06:00Z

1

Effiiency time-complexity bug

while (i+1 < len(pattern)):

len(pattern) is evaluated at each iteration, even if it remains constant, this makes your time complexity n times slower, where n is len(pattern)

Use a variable to fix the bug:

pattern_length = len(pattern)

And:

while (i + 1 < pattern_length):

answered Oct 20, 2015 at 22:06

Caridorc

28.2k7 gold badges55 silver badges138 bronze badges

\$\begingroup\$ Thanks Caridorc, accepted for your reply and appreciate for the learning. Wondering if any other functional bugs in your mind? For functional I mean the next[] is not generated correctly. :) \$\endgroup\$

Lin Ma
– Lin Ma

2015-10-20 22:17:25 +00:00
Commented Oct 20, 2015 at 22:17
5

\$\begingroup\$ len is a constant time operation on strings and other builtin types, because each object stores its length. \$\endgroup\$

Janne Karila
– Janne Karila

2015-10-21 09:37:10 +00:00
Commented Oct 21, 2015 at 9:37
1

\$\begingroup\$ @JanneKarila Good I will delete this answer \$\endgroup\$

Caridorc
– Caridorc

2015-10-21 12:55:53 +00:00
Commented Oct 21, 2015 at 12:55

Add a comment |

Stack Exchange Network

Knuth–Morris–Pratt string match algorithm

2 Answers 2

1. Review

2. Revised code

You must log in to answer this question.

Hot Network Questions

Knuth–Morris–Pratt string match algorithm

2 Answers 2

1. Review

2. Revised code

You must log in to answer this question.

Related

Hot Network Questions