Skip to main content

Analysis of Classic String Matching Algorithm (KMP)

Free2015-03-17#Math#串匹配算法#KMP

The problem background is to find the part in the original string that can completely match the given string. The easiest approach of double-loop comparison is of course not acceptable (time complexity is too high). The most classic solution is the KMP algorithm, which first constructs a jump table (next table), then compares, avoiding backtracking on the main string to save time. The KMP algorithm is difficult to understand; this article mainly analyzes its core next function.

no_mkd

I. Problem Restatement

Given a string S1, find the part in S1 that completely matches string S2, for example: S1 = "ababaababc", S2 = "ababc"

Then the matching result is 5 (the position of 'a' in "ababc" in S1). Of course, if there are multiple S2s in S1, it doesn't matter; if you can find the first one, you can find the second one...

The easiest method is naturally double-loop bit-by-bit comparison (BF algorithm), but in the worst case, the BF algorithm's time complexity reaches m * n, which is unacceptable in practical applications. So three people came up with the KMP algorithm (KMP is three people's names...)

II. Specific Process of KMP Algorithm

(Don't rush to look at the official definition of the KMP algorithm; it's very hard to understand, so we won't provide the definition here...)

Carefully examine S2 = "ababc". An important point is: 'ab' appears at the beginning, and 'ab' appears again in the middle (remember this detail for now)

Now assume a moment:

At this point, 'abab' partial matching is successful, S1's position pointer points to 'a', and what we're currently doing is comparing 'a' with 'c', which results in a match failure.

  • If it's the BF algorithm, then the next step is for S1's position pointer to backtrack to the second 'b', S2's position pointer resets to the beginning 'a', then compare 'b' with 'a', then...
  • If it's the KMP algorithm, then the next step is to keep S1's position pointer unchanged (pointing to 'a'), then check the jump table (next table) to get the jump value 2, then move S2's pointer to the second 'a', then compare 'a' with 'a', then...

It doesn't matter if you don't understand the KMP steps yet, after all, we haven't explained what the jump table (next table) is. But here we only need to focus on the result:

  • In the BF algorithm, S1's pointer moves left (backtracks)
  • In the KMP algorithm, S1's pointer doesn't go back (no backtracking)

Because the KMP algorithm has no backtracking process, it saves a lot of time (S1's pointer only needs to walk from beginning to end)

Let's look again at the core of KMP algorithm—the jump

Why can we jump? Carefully observe S2 = "ababc". The characteristic of this string is: 'ab' appears at the beginning, and 'ab' appears again in the middle (remember this detail again). Here's a detailed explanation:

If the 'c' at the end of S2 fails to match with the k-th position of S1, we can infer two pieces of information:

  1. This matching attempt has failed (c's failure means a certain part of S1 cannot match S2)
  2. The 4 positions before the k-th position in S1 must be 'abab' (only after 'abab' in S2 successfully matches a part of S1 can the comparison between S2's 'c' and S1's k-th position occur)

If we ignore point 2, then the next step is for S1's pointer to backtrack, which is what the BF algorithm will do; if we grasp point 2, plus the characteristic of S2 string:

  • 'ab' appears at the beginning, and 'ab' appears again in the middle (reiterate once more)

we can get the KMP algorithm (equivalent to saying that the 'ab' before the k-th position in S1 has already matched with the 'ab' at the beginning of S2, so we can directly jump to compare S1's k-th position with S2's 3rd position 'a')

Seems a bit clearer now, so how is this jump value 2 obtained? Let's look at more examples:

  • The jump value for the last position of S2 = "aba" is 0
  • The jump value for the last position of S2 = "abaa" is 1
  • The jump value for the last position of S2 = "abcabc" is 2

Have you discovered anything? Yes, the process of finding the jump value for the x-th position in S2 (here x starts from 0) is as follows:

  1. If x = 0, then the jump value of S2[x] is -1 (the first element's jump value is -1)
  2. If x = 1, then the jump value of S2[x] is 0 (the second element's jump value is 0)
  3. If S2[x - 1] = S2[0], then the jump value of S2[x] is 1
  4. If condition 3 is not satisfied, then the jump value is 0
  5. If S2[x - 1] = S2[0] and S2[x - 2] = S2[1], then the jump value of S2[x] is 2
  6. If S2[x - 1] = S2[0] and S2[x - 2] = S2[1] and S2[x - 3] = S2[2], then the jump value of S2[x] is 3
  7. ...

For humans, using eyes to "visually estimate" the jump value according to the method above is the fastest, but the same process is not so easy to implement for computers. Computers have their preferred way. A short blog post explains this approach.

Simply put—it's "recursion", that is, starting from the known first term -1, second term 0, recursively derive all subsequent terms. The detailed process won't be elaborated here; the linked blog post above explains it very clearly.

Now we can derive the specific process of the KMP algorithm:

  1. Construct the jump table (next table) according to the pattern string S2
  2. Start comparing from the beginning of S1, check the next table to get the jump value, move S1's pointer to the right and continue comparing, until the end of S1

To put it plainly, it's trading space for time again (the space occupied by the next table). Of course, in this algorithm, the next table is just a linear table with length equal to the length of pattern string S2, so it doesn't require much space.

III. Implementing the next Function

The next function is used to construct the next table (jump table). How to construct the next table is the key to the KMP algorithm (if you don't care about the proof process of KMP...). We can implement the next function ourselves according to the linked blog post:

//Reference example: http://blog.sina.com.cn/s/blog_96ea9c6f01016l6r.html

#include<stdio.h>

void getNext(char a[], int n, int next[]){ int i, j;

next[0] = -1;//First element jump value is -1
next[1] = 0;//Second element jump value is 0

for(i = 2; i &lt; n; i++){
	j = i - 1;
	
	//Recursively derive remaining values in next table
	while(j != -1){
		if(a[i - 1] == a[next[j]]){
			next[i] = next[j] + 1;
			break;
		}
		else{
			j = next[j];
		}
	}
}

}

main(){ char a[] = "abaabcac";//Pattern string S2 int next[8] = {0};//Jump table, initialized to all 0 int i;

//Construct next table
getNext(a, 8, next);
//Output next table
for(i = 0; i &lt; 8; i++){
	printf("%d ", next[i]);
}
printf("\n");

}

Of course, the getNext function above still doesn't seem efficient enough (double loop...), but the advantage is it's easy to understand. Now let's look at the next function given in the textbook:

#include<stdio.h>

void getNext(char a[], int n, int next[]){ int i, j; i = 0; next[0] = -1;//First element jump value is -1 j = -1;

//Recursively derive remaining values in next table
while(i &lt; n){
	if(j == -1 || a[i] == a[j]){
		++i;
		++j;
		next[i] = j;
	}
	else{
		j = next[j];
	}
}

}

main(){ char a[] = "abaabcac";//Pattern string S2 int next[8] = {0};//Jump table, initialized to all 0 int i;

//Construct next table
getNext(a, 8, next);
//Output next table
for(i = 0; i &lt; 8; i++){
	printf("%d ", next[i]);
}
printf("\n");

}

Both produce the same calculation results, but the algorithm given in the textbook eliminates the double loop. However, the result of doing this is just making the code more complex, without any substantial optimization.

What? Eliminating the double loop doesn't improve efficiency? That can't be right, can it? Let's verify with a counter:

void getNext(char a[], int n, int next[]){
	int i, j;
	int counter = 0;///
next[0] = -1;//First element jump value is -1
next[1] = 0;//Second element jump value is 0

for(i = 2; i &lt; n; i++){
	j = i - 1;
	
	//Recursively derive remaining values in next table
	while(j != -1){
		if(a[i - 1] == a[next[j]]){
			next[i] = next[j] + 1;
			counter++;///
			break;
		}
		else{
			j = next[j];
			counter++;///
		}
	}
}

printf("\n%d\n", counter);///

}

void getNext(char a[], int n, int next[]){ int i, j; int counter = 0;/// i = 0; next[0] = -1;//First element jump value is -1 j = -1;

//Recursively derive remaining values in next table
while(i &lt; n){
	if(j == -1 || a[i] == a[j]){
		++i;
		++j;
		next[i] = j;
		counter++;///
	}
	else{
		j = next[j];
		counter++;///
	}
}

printf("\n%d\n", counter);

}

P.S. We inserted counter++; in the specific operation parts (the innermost if block and else block), so the results obtained are comparable (how many specific operations the algorithm performed)

The running results are as follows:

The left figure shows the counting result of our implemented next function. The 4 fewer steps are the difference in outer loop iterations (the textbook's algorithm is n times, our algorithm is n-2 times). If our algorithm's outer loop iterations were also n, it would also require 14 specific operations (we just made a simple optimization)

The two next functions are just different in form; their internal operation sequences are completely identical. Once you understand the next function, there's no difficult point in the KMP algorithm (if you don't care about why S1's pointer doesn't need to backtrack, indeed there's only this one difficult point...)

IV. Proof of KMP Algorithm Correctness

(This article won't expand on this here; if understood later, the missing content may be supplemented here, explaining in detail why S1's pointer doesn't need to backtrack, why the previous part cannot match again...)

But if purely focusing on "implementation", as long as we understand the next function, we can completely and easily implement the KMP algorithm (as for why this can be done, how to prove this is correct... that's the mathematicians' business)

V. Summary

The core of the KMP algorithm is the construction process of the next table, and the key idea for constructing the next table is "recursion". Understanding this, you can completely write the KMP algorithm in minutes...

A side note:

The KMP algorithm also has variants. This article discusses the most original KMP algorithm. Common variants include:

  • The first element of the next table is 0 (instead of -1). The final result is adding 1 to each element in our next table. It's just a different convention (starting from 0 vs starting from 1), with no substantial difference
  • There are multiple -1s in the next table (our result has one and only one -1, which is the first element of the next table). This is a substantial optimization that can effectively improve efficiency. Actually, it's just doing one more step than us when constructing the next table. The construction process becomes a bit more complex, but the number of comparisons in the matching algorithm is reduced

Comments

No comments yet. Be the first to share your thoughts.

Leave a comment