Knuth–Morris–Pratt (KMP) algorithm

Feroz Ahmad bio photo By Feroz Ahmad Comment
KMP is a pattern matching algorithm which has very good running time. Firstly, what is the naive approach solution for pattern matching ? Answer to this question will lead to the key idea behind the KMP algo. How would you go about matching a pattern with a text ? You might search  the first character of the pattern in the text, and after finding that you will try to match the other sequentially appearing characters in  the pattern till you match all or find a mismatch. Bravo, if you find a match but what if you didn't. You will go back to the point where your first character matched. And start the same procedure from the character next to the first match. This can be seen in the following example : Patterrn Matching Naive There is an annoying step which we perform if we find a mismatch, that is going BACK to the first character of match. The KMP algo improves the worst time complexity significantly just by eliminating this step of going backwards. Obviously, no one likes to step backwards :P Key Concept of KMP algo : The KMP matching algorithm uses degenerating property - pattern having same subpatterns appearing more than once in the pattern. The algorithm first creates an integer array of size same as that of the pattern and stores some useful information after computing on the text, which it then uses while matching the pattern with the text. KMP algorithm does some preprocessing over the pattern pat[] and constructs an auxiliary array lps[] of size same as size of pattern (=m). lps = longest proper prefix which is also suffix. For each sub-pattern pat[0…i] where i = 0 to m-1, lps[i] stores length of the maximum matching proper prefix which is also a suffix of the sub-pattern pat[0..i]. For Example : Let Pattern be FFZFFZFFF : for i=0 : "F" - Proper prefix = "" and Proper Suffix = ""                                        lps[0]=0 for i=1 : "FF" - Proper prefix = "F" and Proper Suffix = "F" Longest common prefix which is also a suffix = "F" ie length  1, thus lps[1]=1 for i=2: "FFZ" - Proper prefix = "F","FF" and Proper Suffix = "Z","FZ"                                      thus lps[2]=0 for i=3: "FFZF" - Proper prefix = "F","FF","FFZ" and Proper Suffix = "F","ZF", "FZF"                                      thus lps[3]=1 for i=4: "FFZFF" - Proper prefix = "F","FF","FFZ","FFZF" and Proper Suffix = "F","FF"ZFF", "FZFF"                                      thus lps[3]=2 (as longest common is "FF") for i=5: "FFZFFZ" - Proper prefix = "F","FF","FFZ","FFZF","FFZFF" and Proper Suffix = "Z","FZ"FFZ", "ZFFZ","FZFFZ"                                      thus lps[4]=3 (as longest common is "FFZ") Similarly lps[i] can be carried out for  whole pattern. Now this lps[] array is used while matching patterns.
Pseudocode Code for Function to create lps array
Function    create_lps( pat,  size) :
    lps[0]=0
    INITIALIZE i=1 and j=0
    WHILE i < PATTERN.LENGTH:
        if PATTERN at i == PATTERN at j :
            lps[i]=j+1
            i++ and j++ //MOVE i and j ahead
        else //MISMATCH
            if j > 0:
                j=lps[j-1];
            else //when j stands at 0
                //NO PART OF PATTERN till i, HAS LPS
                lps[i]=0
                i++
Pseudocode Code for Function to check pattern
Function KMP_chk():
	WHILE i < TEXT.LENGTH :
		if TEXT at i == PATTERN at j :
			i++ and j++
		else
			if mismatch occurs at j>0
				j=lps[j-1];		//THIS IS THE STEP WHICH AVOIDS GOING BACK TO
			else
                                i++;
		if j== PATTERN.LENGTH	//THAT MEANS TEXT at i == PATTERN at j executed j times only because it matches the pattern
			//PATTERN MATCH FOUND, STORE AND FIND MORE OCCURANCES.
Here is the C++ implementation code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#include <iostream>
using namespace std;
int lens(char *s)
{
    int i;
    for(i=0;s[i]!=0;i++);
    return i;
}
void createlps(char *pat, int *lps, int size)
{
    int i,j;
    lps[0]=0;
    i=1;j=0;
    while(i<size)
    {
        if(pat[i]==pat[j])
        {
            lps[i]=j+1;
            i++;j++;
        }
        else
        {
            if(j>0)
            	j=lps[j-1];
            else
            {
	        lps[i]=0;
                i++;
            }
        }
    }
}
void checkoccurances(char *text, char *pat, int *lps, int sizep, int *occurances)
{
    int i,j,k;
    i=0;k=0;j=0;
    while(text[i]!=0)
    {
        if(pat[j]==text[i])
        {
            i++;
            j++;
        }
        else
        {
                if(j>0)
                    j=lps[i-1];
                else
                    i++;
        }
        if(j==sizep)
        {
            occurances[k++]=i-sizep;
            j=0;
        }
    }
    occurances[k]=-1;
}
int main() {
    char text[5000],pat[50];
    int i,occurances[50];
    cout<<"Enter the text : ";
    cin>>text;
    cout<<"Enter the pattern : ";
    cin>>pat;
    int patlen=lens(pat);
    int *lps= new int [patlen];
    createlps(pat,lps,patlen);
    checkoccurances(text,pat,lps,patlen,occurances);
	cout<<"\n \nOccurances are at indices :";
    for(i=0;occurances[i]!=-1;i++)
    	cout<<occurances[i]<<"   ";
cin.ignore();
    cin.get();
    return 0;
}
Note the similarity in the code of createlps() and checkoccurances()
comments powered by Disqus