The Knuth–Morris–Pratt string search algorithm is described in the paper Fast Pattern Matching in Strings (SIAM J. Computing vol. 6 no. 2, June 1977). The initial step of the algorithm is to compute the next table, defined as follows:
The pattern-matching process will run efficiently if we have an auxiliary table that tells us exactly how far to slide the pattern, when we detect a mismatch at its
jth characterpattern[j]. Letnext[j]be the character position in the pattern which should be checked next after such a mismatch, so that we are sliding the patternj − next[j]places relative to the text.
The authors give the example of the pattern abcabcacab. If there is a mismatch at j=7:
abcabcacab
abcabca?
Then the pattern should be moved 3 places to the right and matching should continue with the 4th character of the pattern:
abcabcacab
abcabca?
so next[7] = 4. In some cases we know we can skip the mismatched character entirely, for example if there is a mismatch at j=3:
abcabcacab
abc?
then the search should continue from the character after the mismatch:
abcabcacab
abc?
These special cases are indicated by next[j] = −1.
(If you're reading the paper, note that the authors use indexes starting at 1 as in Fortran, but the Python convention is to use indexes starting at 0, so that's what I'm giving here.)
This is the code that computes the next table. Please review.
def findPattern(pattern):
j = -1
next = [-1] * len(pattern)
i = 0 # next[0] is always -1, by KMP definition
while (i+1 < len(pattern)):
if (j == -1) or (pattern[j] == pattern[i]):
i += 1
j += 1
if pattern[i] != pattern[j]:
next[i] = j
else:
next[i] = next[j]
else:
j = next[j]
return next
if __name__ == "__main__":
print findPattern("aaaab")
print findPattern("abaabc")
Output:
[-1, -1, -1, -1, 3]
[-1, 0, -1, 1, 0, 2]
ABCDABCDbecomes the table [-1, 0, 0, 0, 0, 1, 2, 3], butfindPattern("ABCDABCD")returns [-1, 0, 0, 0, -1, 0, 0, 0]. So either there's a bug in your code, or you are implementing some other table-building function and need to explain in more detail. \$\endgroup\$Ttable described in the Wikipedia article is the same as theftable in KMP — but theftable is just a step in the actual construction of thenexttable, which is what the KMP algorithm actually uses. So ignore what I said about failing to match the Wikipedia algorithm. \$\endgroup\$