Unit 5 DS
Unit 5 DS
Pattern Matching and Tries: Pattern matching algorithms-Brute force, the Boyer –Moore
algorithm, the Knuth-Morris-Pratt algorithm, Standard Tries, Compressed Tries, Suffix tries.
Pattern Matching
Pattern searching is an important problem in computer science. When we do search for a
string in notepad/word file or browser or database, pattern searching algorithms are used to
show the search results.
A typical problem statement would be-
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char
txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
Input: txt[] = "AABAACAADAABAABA"
pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12
Different Types of Pattern Matching Algorithms
1. Navie Based Algorithm or Brute Force Algorithm
2. Boyer Moore Algorithm
3. Knuth-Morris Pratt (KMP) Algorithm
Navie Based Algorithm or Brute Force Algorithm
When we talk about a string matching algorithm, every one can get a simple string matching
technique. That is starting from first letters of the text and first letter of the pattern check
whether these two letters are equal. if it is, then check second letters of the text and pattern. If
it is not equal, then move first letter of the pattern to the second letter of the text. then check
these two letters. this is the simple technique everyone can thought.
Brute Force string matching algorithm is also like that. Therefore we call that as Naive string
0 0
matching algorithm. Naive means basic.
do
if (text letter == pattern letter)
compare next letter of pattern to next letter of text
else
move pattern down text by one letter
while (entire pattern found or end of text)
In above red boxes says mismatch letters against letters of the text and green boxes says
match letters against letters of the text. According to the above
In first raw we check whether first letter of the pattern is matched with the first letter of the
text. It is mismatched, because "S" is the first letter of pattern and "T" is the first letter of text.
Then we move the pattern by one position. Shown in second raw.
0 0
Then check first letter of the pattern with the second letter of text. It is also mismatched.
Likewise we continue the checking and moving process. In fourth raw we can see first letter of
the pattern matched with text. Then we do not do any moving but we increase testing letter of
the pattern. We only move the position of pattern by one when we find mismatches. Also in
last raw, we can see all the letters of the pattern matched with the some letters of the text
continuously.
Example 2
Worst Case
0 0
• Total number of comparisons: M (N-M+1) • Worst case time complexity: Ο(MN)
Best case
Given a pattern M characters in length, and a text N characters in length...
• Best case if pattern found: Finds pattern in first M positions of text.
For example, M=5.
AAAAAAAAAAAAAAAAAAAAAAAAAAAH
AAAAA 5 comparisons made
• Total number of comparisons: M
• Best case time complexity: Ο(M)
Best case if pattern not found:
Always mismatch on first character. For example, M=5.
0 0
Advantages
1. Very simple technique and also that does not require any preprocessing. Therefore total
running time is the same as its matching time.
Disadvantages
1. Very inefficient method. Because this method takes only one position movement in
each time
If a character is compared that is not within the pattern, no match can be found by comparing
any furher characters at this position so the pattern can be shifted completely past the
mismatching character.
For determining the possible shifts , B-M algorithm uses 2 preprocessing strategies
simultaneously whenever a mismatch occurs, the algorithm computes a shift using both
strategies and selects the longer one. thus it makes use of the most efficient stategy for each
individual case
NOTE : Boyer Moore algorithm starts matching from the last character of the pattern.
The 2 strategies are called heuristics of B-M as they are used to reduce the search. They are
0 0
Case 1 – Mismatch become match
We will lookup the position of last occurrence of mismatching character in pattern and if
mismatching character exist in pattern then we’ll shift the pattern such that it get aligned to the
mismatching character in text T.
case 1
Explanation: In the above example, we got a mismatch at position 3. Here our mismatching
character is “A”. Now we will search for last occurrence of “A” in pattern. We got “A” at
position 1 in pattern (displayed in Blue) and this is the last occurrence of it. Now we will shift
pattern 2 times so that “A” in pattern get aligned with “A” in text.
case2
0 0
Explanation: Here we have a mismatch at position 7. The mismatching character “C” does not
exist in pattern before position 7 so we’ll shift pattern past to the position 7 and eventually in
above example we have got a perfect match of pattern (displayed in Green). We are doing this
because, “C” do not exist in pattern so at every shift before position 7 we will get mismatch and
our search will be fruitless.
Problem in Bad Character Heuristic
In some cases Bad Character Heuristic produces negative results
For Example:
This means we need some extra information to produce a shift an encountering a bad
character. The information is about last position of evry character in the pattern and also the
set of every character in the pattern and also the set of characters used in the pattern
0 0
2) A prefix of P, which matches with suffix of t
3) P moves past t
Explanation: In the above example, we have got a substring t of text T matched with pattern P
(in green) before mismatch at index 2. Now we will search for occurrence of t (“AB”) in P. We
have found an occurrence starting at position 1 (in yellow background) so we will right shift the
pattern 2 times to align t in P with t in T. This is weak rule of original Boyer Moore
0 0
Explanation: In above example, we have got t (“BAB”) matched with P (in green) at index 2-4
before mismatch . But because there exists no occurrence of t in P we will search for some
prefix of P which matches with some suffix of t. We have found prefix “AB” (in the yellow
background) starting at index 0 which matches not with whole t but the suffix of t “AB” starting
at index 3. So now we will shift pattern 3 times to align prefix with the suffix.
0 0
Explanation: If above example, there exist no occurrence of t (“AB”) in P and also there is no
prefix in P which matches with the suffix of t. So, in that case, we can never find any perfect
match before index 4, so we will shift the P past the t ie. to index 5.
Suppose substring q = P[i to n] got matched with t in T and c = P[i-1] is the mismatching
character. Now unlike case 1 we will search for t in P which is not preceded by character c. The
closest such occurrence is then aligned with t in T by shifting pattern P. For example –
0 0
1) Preprocessing for Strong Good Suffix
Before discussing preprocessing, let us first discuss the idea of border. A border is a substring
which is both proper suffix and proper prefix. For example, in string “ccacc”, “c” is a
border, “cc” is a border because it appears in both end of string but “cca” is not a border.
This algorithm takes o(mn) in the worst case and O(nlog(m)/m) on average case,
which is the sub linear in the sense that not all characters are inspected
Applications
This algorithm is highly useful in tasks like recursively searching files for virus patterns,searching
databases for keys or data ,text and word processing and any other task that requires handling
large amount of data at very high speed
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive
KMP Algorithm is one of the most popular patterns matching algorithms. KMP stands for Knuth
Morris Pratt. KMP algorithm was invented by Donald Knuth and Vaughan Pratt together and
independently by James H Morris in the year 1970. In the year 1977, all the three jointly
published KMP Algorithm.
0 0
KMP algorithm was the first linear time complexity algorithm for string matching.
KMP algorithm is one of the string matching algorithms used to find a Pattern in a Text.
KMP algorithm is used to find a "Pattern" in a "Text". This algorithm campares character by
character from left to right. But whenever a mismatch occurs, it uses a preprocessed table
called "Prefix Table" to skip characters comparison while matching. Some times prefix table is
also known as LPS Table. Here LPS stands for "Longest proper Prefix which is also Suffix".
0 0
0 0
How to use LPS Table
We use the LPS table to decide how many characters are to be skipped for comparison
when a mismatch has occurred.
When a mismatch occurs, check the LPS value of the previous character of the mismatched
0 0
character in the pattern. If it is '0' then start comparing the first character of the pattern with
the next character to the mismatched character in the text. If it is not '0' then start comparing
the character which is at an index value equal to the LPS value of the previous character to the
mismatched character in pattern with the mismatched character in the Text.
EXAMPLE 1
0 0
0 0
0 0
0 0
Example 2
0 0
KMP ALGORITHM COMPLEXITY
0 0
TRIES DATA STRUCTURE
Trie is an efficient information reTrieval data structure. The term tries comes from the word
retrieval
Definition of a Trie
Data structure for representing a collection of strings
In computer science , a trie also called digital tree or radix tree or prefix tree.
Tries support fast string matching.
Properties of Tries
EXAMPLE
0 0
Trie | (Insert and Search)
Trie is an efficient information retrieval data structure. Using Trie, search complexities can be
brought to an optimal limit (key length).
Given multiple strings. The task is to insert the string in a Trie
Examples:
root
/ \
c t
| |
a h
|\ |
l t e
| | \
l i r
|\ | |
e i r e
| |
r n
0 0
root
/ |\
l n t
| |
l d
|\ |
e iy
| |
r n
Approach: An efficient approach is to treat every character of the input key as an individual trie
node and insert it into the trie. Note that the children are an array of pointers (or references) to
next level trie nodes. The key character acts as an index into the array of children. If the input
key is new or an extension of the existing key, we need to construct non-existing nodes of the
key, and mark end of the word for the last node. If the input key is a prefix of the existing key in
Trie, we simply mark the last node of the key as the end of a word. The key length determines
Trie depth.
Trie deletion
0 0
Here is an algorithm how to delete a node from trie.
During delete operation we delete the key in bottom up manner using recursion. The following
are possible conditions when deleting key from trie,
1. Key may not be there in trie. Delete operation should not modify trie.
2. Key present as unique key (no part of key contains another key (prefix), nor the key
itself is prefix of another key in trie). Delete all the nodes.
3. Key is prefix key of another long key in trie. Unmark the leaf node.
4. Key present in trie, having atleast one other key as prefix key. Delete nodes from end of
key until first leaf node of longest prefix key.
Time Complexity: The time complexity of the deletion operation is O(n) where n is the key
length
Tries is a tree that stores strings. The maximum number of children of a node is equal to the
size of the alphabet. Trie supports search, insert and delete operations in O(L) time
where L is the length of the key.
Hashing:- In hashing, we convert the key to a small value and the value is used to index
data. Hashing supports search, insert and delete operations in O(L) time on average.
Self Balancing BST : The time complexity of the search, insert and delete operations in a
self-balancing Binary Search Tree (BST) (like Red-Black Tree, AVL Tree, Splay Tree, etc) is O(L
* Log n) where n is total number words and L is the length of the word. The advantage of
Self-balancing BSTs is that they maintain order which makes operations like minimum,
maximum, closest (floor or ceiling) and kth largest faster.
Why Trie? :-
0 0
1. With Trie, we can insert and find strings in O(L) time where L represent the length of a
single word. This is obviously faster than BST. This is also faster than Hashing because of
the ways it is implemented. We do not need to compute any hash function. No collision
handling is required (like we do in open addressing and separate chaining)
2. Another advantage of Trie is, we can easily print all words in alphabetical order which is
not easily possible with hashing.
APPLICATIONS OF TRIES
String handling and processing are one of the most important topics for programmers.
Many real time applications are based on the string processing like:
The data structure that is very important for string handling is the Trie data structure that is
based on prefix of string
0 0
TYPES OF TRIES
1. Standard Tries
2. Compressed Tries
3. Suffix Tries
STANDARD TRIES
Strings={ a,an,and,any}
0 0
Example of Standard Trie
Handling Keys(strings)
0 0
Standard Trie Searching
0 0
II. Does not part of any other node
EXAMPLE
COMPRESSED TRIE
0 0
5. It consists of grouping, re-grouping and un-grouping of keys of characters.
6. While performing the insertion operation, it may be required to un-group the already
grouped characters.
7. While performing the deletion operation, it may be required to re-group the already
grouped characters.
0 0
Storage of Compressed Trie
A compressed Trie can be stored at O9s) where s= | S| by using O(1) Space index ranges at the
nodes
0 0
SUFFIX TRIES
A Suffix trie have the following properties:
1. Suffix trie is a compressed trie for all the suffixes of the text
2. Suffix trie are space efficient data structure to store a string that allows many kinds of
queries to be answered quickly.
Example
0 0
Advantages of suffix tries
0 0
UNIT - V
Pattern Matching and Tries: Pattern matching algorithms-Brute force, the Boyer –Moore
algorithm, the Knuth-Morris-Pratt algorithm, Standard Tries, Compressed Tries, Suffix tries.
Pattern Matching
Pattern searching is an important problem in computer science. When we do search for a
string in notepad/word file or browser or database, pattern searching algorithms are used to
show the search results.
A typical problem statement would be-
Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char
txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n > m.
Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
Input: txt[] = "AABAACAADAABAABA"
pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12
Different Types of Pattern Matching Algorithms
1. Navie Based Algorithm or Brute Force Algorithm
2. Boyer Moore Algorithm
3. Knuth-Morris Pratt (KMP) Algorithm
Navie Based Algorithm or Brute Force Algorithm
When we talk about a string matching algorithm, every one can get a simple string matching
technique. That is starting from first letters of the text and first letter of the pattern check
whether these two letters are equal. if it is, then check second letters of the text and pattern. If
it is not equal, then move first letter of the pattern to the second letter of the text. then check
these two letters. this is the simple technique everyone can thought.
Brute Force string matching algorithm is also like that. Therefore we call that as Naive string
0 0
matching algorithm. Naive means basic.
do
if (text letter == pattern letter)
compare next letter of pattern to next letter of text
else
move pattern down text by one letter
while (entire pattern found or end of text)
In above red boxes says mismatch letters against letters of the text and green boxes says
match letters against letters of the text. According to the above
In first raw we check whether first letter of the pattern is matched with the first letter of the
text. It is mismatched, because "S" is the first letter of pattern and "T" is the first letter of text.
Then we move the pattern by one position. Shown in second raw.
0 0
Then check first letter of the pattern with the second letter of text. It is also mismatched.
Likewise we continue the checking and moving process. In fourth raw we can see first letter of
the pattern matched with text. Then we do not do any moving but we increase testing letter of
the pattern. We only move the position of pattern by one when we find mismatches. Also in
last raw, we can see all the letters of the pattern matched with the some letters of the text
continuously.
Example 2
Worst Case
0 0
• Total number of comparisons: M (N-M+1) • Worst case time complexity: Ο(MN)
Best case
Given a pattern M characters in length, and a text N characters in length...
• Best case if pattern found: Finds pattern in first M positions of text.
For example, M=5.
AAAAAAAAAAAAAAAAAAAAAAAAAAAH
AAAAA 5 comparisons made
• Total number of comparisons: M
• Best case time complexity: Ο(M)
Best case if pattern not found:
Always mismatch on first character. For example, M=5.
0 0
Advantages
1. Very simple technique and also that does not require any preprocessing. Therefore total
running time is the same as its matching time.
Disadvantages
1. Very inefficient method. Because this method takes only one position movement in
each time
If a character is compared that is not within the pattern, no match can be found by comparing
any furher characters at this position so the pattern can be shifted completely past the
mismatching character.
For determining the possible shifts , B-M algorithm uses 2 preprocessing strategies
simultaneously whenever a mismatch occurs, the algorithm computes a shift using both
strategies and selects the longer one. thus it makes use of the most efficient stategy for each
individual case
NOTE : Boyer Moore algorithm starts matching from the last character of the pattern.
The 2 strategies are called heuristics of B-M as they are used to reduce the search. They are
0 0
Case 1 – Mismatch become match
We will lookup the position of last occurrence of mismatching character in pattern and if
mismatching character exist in pattern then we’ll shift the pattern such that it get aligned to the
mismatching character in text T.
case 1
Explanation: In the above example, we got a mismatch at position 3. Here our mismatching
character is “A”. Now we will search for last occurrence of “A” in pattern. We got “A” at
position 1 in pattern (displayed in Blue) and this is the last occurrence of it. Now we will shift
pattern 2 times so that “A” in pattern get aligned with “A” in text.
case2
0 0
Explanation: Here we have a mismatch at position 7. The mismatching character “C” does not
exist in pattern before position 7 so we’ll shift pattern past to the position 7 and eventually in
above example we have got a perfect match of pattern (displayed in Green). We are doing this
because, “C” do not exist in pattern so at every shift before position 7 we will get mismatch and
our search will be fruitless.
Problem in Bad Character Heuristic
In some cases Bad Character Heuristic produces negative results
For Example:
This means we need some extra information to produce a shift an encountering a bad
character. The information is about last position of evry character in the pattern and also the
set of every character in the pattern and also the set of characters used in the pattern
0 0
2) A prefix of P, which matches with suffix of t
3) P moves past t
Explanation: In the above example, we have got a substring t of text T matched with pattern P
(in green) before mismatch at index 2. Now we will search for occurrence of t (“AB”) in P. We
have found an occurrence starting at position 1 (in yellow background) so we will right shift the
pattern 2 times to align t in P with t in T. This is weak rule of original Boyer Moore
0 0
Explanation: In above example, we have got t (“BAB”) matched with P (in green) at index 2-4
before mismatch . But because there exists no occurrence of t in P we will search for some
prefix of P which matches with some suffix of t. We have found prefix “AB” (in the yellow
background) starting at index 0 which matches not with whole t but the suffix of t “AB” starting
at index 3. So now we will shift pattern 3 times to align prefix with the suffix.
0 0
Explanation: If above example, there exist no occurrence of t (“AB”) in P and also there is no
prefix in P which matches with the suffix of t. So, in that case, we can never find any perfect
match before index 4, so we will shift the P past the t ie. to index 5.
Suppose substring q = P[i to n] got matched with t in T and c = P[i-1] is the mismatching
character. Now unlike case 1 we will search for t in P which is not preceded by character c. The
closest such occurrence is then aligned with t in T by shifting pattern P. For example –
0 0
1) Preprocessing for Strong Good Suffix
Before discussing preprocessing, let us first discuss the idea of border. A border is a substring
which is both proper suffix and proper prefix. For example, in string “ccacc”, “c” is a
border, “cc” is a border because it appears in both end of string but “cca” is not a border.
This algorithm takes o(mn) in the worst case and O(nlog(m)/m) on average case,
which is the sub linear in the sense that not all characters are inspected
Applications
This algorithm is highly useful in tasks like recursively searching files for virus patterns,searching
databases for keys or data ,text and word processing and any other task that requires handling
large amount of data at very high speed
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive
KMP Algorithm is one of the most popular patterns matching algorithms. KMP stands for Knuth
Morris Pratt. KMP algorithm was invented by Donald Knuth and Vaughan Pratt together and
independently by James H Morris in the year 1970. In the year 1977, all the three jointly
published KMP Algorithm.
0 0
KMP algorithm was the first linear time complexity algorithm for string matching.
KMP algorithm is one of the string matching algorithms used to find a Pattern in a Text.
KMP algorithm is used to find a "Pattern" in a "Text". This algorithm campares character by
character from left to right. But whenever a mismatch occurs, it uses a preprocessed table
called "Prefix Table" to skip characters comparison while matching. Some times prefix table is
also known as LPS Table. Here LPS stands for "Longest proper Prefix which is also Suffix".
0 0
0 0
How to use LPS Table
We use the LPS table to decide how many characters are to be skipped for comparison
when a mismatch has occurred.
When a mismatch occurs, check the LPS value of the previous character of the mismatched
0 0
character in the pattern. If it is '0' then start comparing the first character of the pattern with
the next character to the mismatched character in the text. If it is not '0' then start comparing
the character which is at an index value equal to the LPS value of the previous character to the
mismatched character in pattern with the mismatched character in the Text.
EXAMPLE 1
0 0
0 0
0 0
0 0
Example 2
0 0
KMP ALGORITHM COMPLEXITY
0 0