0% found this document useful (0 votes)

16 views10 pages

String Matching and Hashing

The document discusses string algorithms, focusing on string hashing and the prefix function. String hashing allows for efficient string comparison by converting strings into hash values, while the prefix function helps identify the longest proper prefix which is also a suffix. Applications of these algorithms include the Rabin-Karp algorithm for string matching and the KMP algorithm for finding substring occurrences.

Uploaded by

nahin.genmorphics

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views10 pages

String Matching and Hashing

Uploaded by

nahin.genmorphics

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

String Algorithms

1) String Hashing

Comparing two strings efficiently

● The brute force way of doing so is just to compare the letters of both strings,
which has a time complexity of O(min(n1,n2)) where n1 and n2 are the length of
two strings.
● The idea behind string hashing is the following: we convert each string into an
integer and compare those instead of the strings. Comparing two strings is then
an O(1) operation!!! For the conversion, we need a so-called hash function and
corresponding hash value.
● A hash value of a string is a number that is calculated from the characters of the
string. If two strings are the same, their hash values are also the same, which
makes it possible to compare strings based on their hash values. The function
which is used to calculate the hash value of the string is called a hash function.

Calculation of the hash of a string

● So basically for a string S = s1,s2,s3,......,sn, we want to assign to it a unique

number (in general a is assigned 1, b is assigned 2 ,....., z is assigned 26 )
which can be calculated from information stored in S. We can store, say, the
sum of characters in S. But two different strings may evaluate to the same hash.
Eg- hash(“abc”) = 1+2+3 = 6, hash(“aad”) = 1+1+4 = 6.
● So, we have to define intelligent hashes, so as to minimise collisions (a
collision occurs if two different strings evaluate to the same hash).
● A usual way to implement string hashing is polynomial rolling hash function.
2 3 𝑛−1
ℎ𝑎𝑠ℎ[𝑠] = 𝑠[0] + 𝑠[1]. 𝑝 + 𝑠[2]. 𝑝 + 𝑠[3]. 𝑝 +....... + 𝑠[𝑛 − 1]. 𝑝 𝑚𝑜𝑑 𝑚
𝑛−1
𝑖
= ∑ 𝑠[𝑖] . 𝑝 𝑚𝑜𝑑 𝑚
𝑖=0

where n is the length of the string, m and p are some chosen, positive numbers.
● It is reasonable to make p a prime number roughly equal to the number of
characters in the input alphabet. For example, if the input is composed of only
lowercase letters of the English alphabet, p=31 is a good choice.
CODE
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) %
m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}

Reducing the chance of collision

● We should keep m such that m*m doesn’t result in overflow in C/C++. Also, a
good choice for m is some large prime number like 109+9.
● We can just compute two different hashes for each string (by using two different
p, and/or different m), and compare these pairs instead.

Applications of hashing
1. Calculating hash value of any substring in O(1) after
preprocessing-
The idea is to construct an array hash such that hash[k] contains the hash value
of the prefix s[0...k]. Now,
2 𝑗−𝑖
ℎ𝑎𝑠ℎ(𝑠[𝑖... 𝑗]) = 𝑠[𝑖] + 𝑠[𝑖 + 1]. 𝑝 + 𝑠[𝑖 + 2]. 𝑝 +..... + 𝑠[𝑗]. 𝑝 𝑚𝑜𝑑 𝑚
𝑗
𝑘−𝑖
ℎ𝑎𝑠ℎ(𝑠[𝑖…𝑗]) = ∑ 𝑠[𝑘]⋅𝑝 𝑚𝑜𝑑 𝑚
𝑘=𝑖

We can compute the hash of any substring directly using this formula.For this,
we must be able to divide hash(s[0…j])−hash(s[0…i−1]) by pi .Therefore we
need to find the modular multiplicative inverse of pi and then perform
multiplication with this inverse. We can pre compute all the inverse which allows
computing the hash of any substring of O(1) time.

𝑗
𝑖 𝑘
ℎ𝑎𝑠ℎ(𝑠[𝑖…𝑗])⋅𝑝 = ∑ 𝑠[𝑘]⋅𝑝 𝑚𝑜𝑑 𝑚
𝑘=𝑖

= ℎ𝑎𝑠ℎ(𝑠[0…𝑗]) − ℎ𝑎𝑠ℎ(𝑠[0…𝑖 − 1]) 𝑚𝑜𝑑 𝑚

Rather than calculating the hashes of substring exactly, it is enough to compute
the hash multiplied by some power of p as we are only concerned about
matching two strings and not finding the exact hash value.Suppose we have two
𝑖 𝑗
hashes of two substrings, one multiplied by 𝑝 and the other by 𝑝 . If i<j then we
𝑗−𝑖
multiply the first hash by 𝑝 .

2. Rabin-Karp algorithm for string matching-

Problem: Given two strings - a pattern s and a text t, determine if the
pattern appears in the text and if it does, enumerate all its occurrences in
O(|s|+|t|) time.

Algorithm- Calculate the hash value of string s and hash values for all the prefix
of text t. Now we will compare all substrings of length |s| of string t with the string
s by the method described above. Complexity for each comparison will be O(1).
O(|s|) is required for calculating the hash of the pattern and O(|t|) for comparing
each substring of length |s| with the pattern, hence net complexity is O(|s|+|t|).

CODE:
vector<int> rabin_karp(string const& s, string const& t) {
const int p = 31;
const int m = 1e9 + 9;
int S = s.size(), T = t.size();

vector<long long> p_pow(max(S, T));

p_pow[0] = 1;
for (int i = 1; i < (int)p_pow.size(); i++)
p_pow[i] = (p_pow[i-1] * p) % m;

vector<long long> h(T + 1, 0);

for (int i = 0; i < T; i++)
h[i+1] = (h[i] + (t[i] - 'a' + 1) * p_pow[i]) % m;
long long h_s = 0;
for (int i = 0; i < S; i++)
h_s = (h_s + (s[i] - 'a' + 1) * p_pow[i]) % m;

vector<int> occurences;
for (int i = 0; i + S - 1 < T; i++) {
long long cur_h = (h[i+S] + m - h[i]) % m;
if (cur_h == h_s * p_pow[i] % m)
occurences.push_back(i);
}
return occurences;
}

3. Calculating the number of different substrings of a string.

4. Calculating the number of palindromic substrings in a string.
5. Search for duplicate strings in an array of strings.

NOTE:- Using hashing will not be 100% deterministically correct, because two complete
different strings might have the same hash (the hashes collide). The solution of hashing can
be hacked. So, it is advisable to minimise the use of hashing and apply other algorithms
discussed later as much as you can.

2.) Prefix Function

Introduction

You are given a string s of length n. The prefix function for this string is defined
as an array π of length n, where π[i] is the length of the longest proper prefix of
the substring s[0…i] which is also a suffix of this substring. A proper prefix of a
string is a prefix that is not equal to the string itself. By definition, π[0]=0
because string of length 1 has no proper prefix.

For example, prefix function of string "abcabcd" is [0,0,0,1,2,3,0], and prefix

function of string "aabaaab" is [0,1,0,1,2,2,3].

Naive Algorithm - O(n 3):

Iterate over the string and for each index calculate its prefix value.For all O(n)
lengths from 1 to n, try all possible O(n) lengths and check whether prefix is
equal to suffix with each comparison takes O(n) time.

vector<int> prefix_function(string s) {

int n = (int)s.length();

vector<int> pi(n);
for (int i = 0; i < n; i++)

for (int k = 0; k <= i; k++)

if (s.substr(0, k) == s.substr(i-k+1, k))

pi[i] = k;

return pi;

Improved Algorithm - O(n 2):

Optimization 1- The first observation we make is that π[i+1]≤π[i]+1. That is,

the values of the prefix function can only increase by at most one.

Proof- Indeed, otherwise, if π[i+1]>π[i]+1, then we can take this suffix ending in
position i+1 with the length π[i+1] and remove the last character from it. We end
up with a suffix ending in position i with the length π[i+1]−1, which is better than
π[i], i.e. we get a contradiction.
The above fact allows us to reduce the complexity of the algorithm to O(n2),
because in one step the prefix function can grow at most by one. In total the
function can grow at most n steps, and therefore also only can decrease a total
of n steps. This means we only have to perform O(n) string comparisons, and
reach the complexity O(n2).

Final Algorithm - O(n):

Optimization 2-
● We first try extending the suffix of length j=π[i] by checking if s[j]=s[i+1]. If that
works, cool, we’re done already for that length!

● Otherwise, we need to find the next-longest possible j that still works as an equal

prefix/suffix for i, that is, the next-longest suffix length that we could “extend” from i

i.e. s[0…j−1]=s[i−j+1…i]. And if that j doesn’t work, we find the next-longest j

again, and so on. It can happen that this goes until j=0.. If then s[i+1]=s[0], we
will assign π[i+1]=1, and π[i+1]=0 otherwise.

● The only question left is how do we effectively find the lengths for j!!
We’re looking for the largest k<j such that k is also a valid prefix/suffix length.The
illustration shows that this has to be the value of π[j−1], which we already
calculated earlier.
Proof: In the above illustration let string1 = s0s1s2s3 and string2 = si-3si-2si-1si and
j= π[i]. By the definition of prefix function, string1 = string2. Now k is the largest
value <j such that suffix of string2 = prefix of the whole string (which is equal
to the prefix of string1 since k<length of string1). Also, suffix of string 2 =
suffix of string 1 (since both strings are equal) which ultimately implies, suffix
of string 1 = prefix of string 1, and thus we have to select the largest value of
k possible for the above equality, which is nothing other than the value
Π[3] 𝑜𝑟 Π[𝑗 − 1].So, k = π[j−1].
● This means the algorithm actually comes out to be really simple: set j to π[i], then
while j>0 and the current j doesn’t work, keep setting j to π[j−1]. Notice that we
still have at most O(n) string comparisons, but each one only compares two
characters, so it’s now O(1) per comparison. That means we’ve done it - we can
compute the whole prefix function in O(n)!

CODE

vector<int> prefix_function(string s) {

int n = (int)s.length();

vector<int> pi(n);

for (int i = 1; i < n; i++) {

int j = pi[i-1];

while (j > 0 && s[i] != s[j])

j = pi[j-1];

if (s[i] == s[j])

j++;

pi[i] = j;

return pi;

Applications of prefix function

1. KMP Algorithm-
Problem: Given a text t and a string s, we want to find and display the
positions of all occurrences of the string s in the text t.
Algorithm:We generate the string s+#+t, where # is a separator that appears
neither in s nor in t. Let us compute the prefix function for this string. Notice how
the value of the prefix function will never exceed ∣s∣, because if so, some character
would have to equal the separator, which is not possible.If equality π[i]=n(length of
the string s) is achieved, then it means that the string s appears completely in at
this position, i.e. it ends at position i. It means we’ve actually found a match that
ends at i! Thus if at some position i we have π[i]=n, then at the position
i−(n+1)−n+1=i−2n in the string t the string s appears.

In total, this is O(∣s∣+∣t∣) in both time and memory. However, you can optimize it
to

O(∣s∣) memory by not explicitly storing the prefix values after the separator as it
isn’t necessary. The implementation below does not do this.

CODE:

vector<int> kmp_algo(string text, string s) {

int n = s.length(), m = text.length();

string str = s + "#" + text;

vector<int> pi = prefix_function(str), ans;

for (int i = n + 1; i <= n + m; i++) { /* n + 1 is where the text starts */

if (pi[i] == n) {

ans.push_back(i - 2 * n); /* i - (n - 1) - (n + 1) */

return ans;

}
2. The number of different substring in a string
3. Compressing a string
4. Counting the number of occurrences of each prefix

Practice Problems:
http://www.spoj.com/problems/NAJPF/

http://codeforces.com/problemset/problem/271/D

https://codeforces.com/problemset/problem/835/D

http://codeforces.com/contest/808/problem/G

http://www.spoj.com/problems/SUFEQPRE/

54.string Inotes
No ratings yet
54.string Inotes
20 pages
11 Data Structures and Algorithms - Narasimha Karumanchi
No ratings yet
11 Data Structures and Algorithms - Narasimha Karumanchi
12 pages
UNIT-V String Matching
No ratings yet
UNIT-V String Matching
24 pages
Cyber Security Course Content
0% (1)
Cyber Security Course Content
6 pages
Blue Book On AI and Rule of Law in The World (2021) Yadong Cui All Chapters Instant Download
100% (2)
Blue Book On AI and Rule of Law in The World (2021) Yadong Cui All Chapters Instant Download
76 pages
String Problems
No ratings yet
String Problems
20 pages
? Ultimate DSA Sheet-Tricks, Codes & Optimized Approaches
No ratings yet
? Ultimate DSA Sheet-Tricks, Codes & Optimized Approaches
53 pages
Endevor Package Creation
100% (3)
Endevor Package Creation
40 pages
Manual CatiaTypeV5 50 US
No ratings yet
Manual CatiaTypeV5 50 US
73 pages
07 Hashing
No ratings yet
07 Hashing
73 pages
Programming Assignment 3: Hash Tables and Hash Functions
No ratings yet
Programming Assignment 3: Hash Tables and Hash Functions
19 pages
String Matching
No ratings yet
String Matching
116 pages
Dsa and Algo
No ratings yet
Dsa and Algo
43 pages
Lec 10
No ratings yet
Lec 10
36 pages
DP Connections Compatibility Table
100% (2)
DP Connections Compatibility Table
2 pages
cd410 Thermal Label Printer Manual
No ratings yet
cd410 Thermal Label Printer Manual
7 pages
All Roads Lead To Likelihood: The Value of Reinforcement Learning in Fine-Tuning
No ratings yet
All Roads Lead To Likelihood: The Value of Reinforcement Learning in Fine-Tuning
22 pages
String Vikram
No ratings yet
String Vikram
27 pages
Advanced String Lecture
No ratings yet
Advanced String Lecture
50 pages
Lec 12 v1
No ratings yet
Lec 12 v1
22 pages
WSN 1
No ratings yet
WSN 1
41 pages
String Matching
No ratings yet
String Matching
16 pages
Boyer Moore Algorithm: Idan Szpektor
100% (1)
Boyer Moore Algorithm: Idan Szpektor
48 pages
Imp Answers
No ratings yet
Imp Answers
29 pages
Operating Systems: Internals and Design Principles: Memory Management
No ratings yet
Operating Systems: Internals and Design Principles: Memory Management
41 pages
String Codes
No ratings yet
String Codes
11 pages
54.string 2notes
No ratings yet
54.string 2notes
20 pages
Pattern Matching: Suffix Tree Applications
No ratings yet
Pattern Matching: Suffix Tree Applications
39 pages
Introduction To Object Oriented Programming
No ratings yet
Introduction To Object Oriented Programming
13 pages
DSA Assignment 01
No ratings yet
DSA Assignment 01
15 pages
Fortinet FortiMail Study Guide For FortiMail 7.2 - Fortinet Training Institute-101-121
No ratings yet
Fortinet FortiMail Study Guide For FortiMail 7.2 - Fortinet Training Institute-101-121
21 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
Atlas Copco Tensor DS
No ratings yet
Atlas Copco Tensor DS
14 pages
Z Function and Its Calculation:: Int Int Int Int For Int If While If
No ratings yet
Z Function and Its Calculation:: Int Int Int Int For Int If While If
32 pages
DSA - Strings - Notes
No ratings yet
DSA - Strings - Notes
8 pages
DAA DA Output
No ratings yet
DAA DA Output
9 pages
Daa Da
No ratings yet
Daa Da
9 pages
String Sorts (Java)
No ratings yet
String Sorts (Java)
71 pages
Minimum Length of String After Operations - Editorial
No ratings yet
Minimum Length of String After Operations - Editorial
6 pages
Experiment 9 DAA
No ratings yet
Experiment 9 DAA
5 pages
t6 Mbap Day 3 v3.14
No ratings yet
t6 Mbap Day 3 v3.14
14 pages
22mc3014 EK AA LAB XI
No ratings yet
22mc3014 EK AA LAB XI
8 pages
Bilingual Remote Community Manager - English and Korean or Japanese or Thai or Chinese Mandarin - Arise Work From Home
No ratings yet
Bilingual Remote Community Manager - English and Korean or Japanese or Thai or Chinese Mandarin - Arise Work From Home
8 pages
A8 - Java Syntax Lab
No ratings yet
A8 - Java Syntax Lab
6 pages
4101 Assignment 9
No ratings yet
4101 Assignment 9
5 pages
Rabin Karp
100% (1)
Rabin Karp
13 pages
10 String Algorithms
No ratings yet
10 String Algorithms
36 pages
Applicationstructure 8
No ratings yet
Applicationstructure 8
3 pages
Increasing Students Interaction in Distance Education Using Gamification
No ratings yet
Increasing Students Interaction in Distance Education Using Gamification
5 pages
Draft 1
No ratings yet
Draft 1
6 pages
Strings
No ratings yet
Strings
23 pages
Suffix Array Tutorial
No ratings yet
Suffix Array Tutorial
17 pages
1 s2.0 S0020019015000411 Main
No ratings yet
1 s2.0 S0020019015000411 Main
3 pages
5g Technology Evolution
No ratings yet
5g Technology Evolution
5 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
Rabin-Karp Algorithm For Pattern Searching: Examples
No ratings yet
Rabin-Karp Algorithm For Pattern Searching: Examples
5 pages
Rocio Gil Peña
No ratings yet
Rocio Gil Peña
8 pages
Rabin Karp Algorithm of Pattern Matching (Goutam Padhy)
No ratings yet
Rabin Karp Algorithm of Pattern Matching (Goutam Padhy)
15 pages
UML
No ratings yet
UML
1 page
Rabin-Karp String Matching Algorithm
No ratings yet
Rabin-Karp String Matching Algorithm
11 pages
Module 06. String Algorithms Lecture 3-6
No ratings yet
Module 06. String Algorithms Lecture 3-6
48 pages
Suffix Arrays: Justin Zhang 24 May 2017
No ratings yet
Suffix Arrays: Justin Zhang 24 May 2017
5 pages
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
No ratings yet
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
18 pages
PBL PPT
No ratings yet
PBL PPT
13 pages
My Project
No ratings yet
My Project
2 pages
Symbiosis Centre For Information Technology: MBA-DSDA 2020-22 (Semester I) Research Methodology
No ratings yet
Symbiosis Centre For Information Technology: MBA-DSDA 2020-22 (Semester I) Research Methodology
4 pages
Application of A Modified Convolution Method To Exact String Matching
No ratings yet
Application of A Modified Convolution Method To Exact String Matching
6 pages
B.tech 4-2 R18 Timetables An
No ratings yet
B.tech 4-2 R18 Timetables An
3 pages
Space & Time Complexity
No ratings yet
Space & Time Complexity
3 pages
Ap 5
No ratings yet
Ap 5
5 pages
02 Cad-Cam
No ratings yet
02 Cad-Cam
11 pages
Querying Microsoft SQL Server 2014
No ratings yet
Querying Microsoft SQL Server 2014
6 pages
Interview Camp: Level: Hard String Search: Find The Index Where The Larger String A Contains A Target String T
No ratings yet
Interview Camp: Level: Hard String Search: Find The Index Where The Larger String A Contains A Target String T
3 pages
APExp4 Tekrat
No ratings yet
APExp4 Tekrat
6 pages
Rolling Hash (Rabin-Karp Algorithm) : Objective
No ratings yet
Rolling Hash (Rabin-Karp Algorithm) : Objective
4 pages
NLP Unit 6
No ratings yet
NLP Unit 6
16 pages
Daa Exp 09
No ratings yet
Daa Exp 09
7 pages
Printable Newsletter Templates Teal
No ratings yet
Printable Newsletter Templates Teal
1 page
Solution Notes
No ratings yet
Solution Notes
3 pages
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
No ratings yet
Cse2012 Design and Analysis of Algorithms Lab Digital Assignment 2
18 pages
02 SAFe PI Planning Toolkit Usage Guide
No ratings yet
02 SAFe PI Planning Toolkit Usage Guide
2 pages
Topcoder Article
No ratings yet
Topcoder Article
8 pages
Course Outline ADP CS-2 2022-2024
No ratings yet
Course Outline ADP CS-2 2022-2024
8 pages
Saipranaymasadi Resume
No ratings yet
Saipranaymasadi Resume
1 page
E193 (10.22%) (E193-Il)
No ratings yet
E193 (10.22%) (E193-Il)
1 page
Strings
No ratings yet
Strings
73 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Basic Exercises for Competitive Programming: Python
From Everand
Basic Exercises for Competitive Programming: Python
Jan Pol
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
An Introduction to Linear Algebra and Tensors
From Everand
An Introduction to Linear Algebra and Tensors
M. A. Akivis
1/5 (1)
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

String Matching and Hashing

Uploaded by

String Matching and Hashing

Uploaded by

String Algorithms

Comparing two strings efficiently

Calculation of the hash of a string

● So basically for a string S = s1,s2,s3,......,sn, we want to assign to it a unique

Reducing the chance of collision

= ℎ𝑎𝑠ℎ(𝑠[0…𝑗]) − ℎ𝑎𝑠ℎ(𝑠[0…𝑖 − 1]) 𝑚𝑜𝑑 𝑚

2. Rabin-Karp algorithm for string matching-

vector<long long> p_pow(max(S, T));

vector<long long> h(T + 1, 0);

3. Calculating the number of different substrings of a string.

2.) Prefix Function

For example, prefix function of string "abcabcd" is [0,0,0,1,2,3,0], and prefix

Naive Algorithm - O(n 3):

for (int k = 0; k <= i; k++)

if (s.substr(0, k) == s.substr(i-k+1, k))

Improved Algorithm - O(n 2):

Optimization 1- The first observation we make is that π[i+1]≤π[i]+1. That is,

Final Algorithm - O(n):

i.e. s[0…j−1]=s[i−j+1…i]. And if that j doesn’t work, we find the next-longest j

for (int i = 1; i < n; i++) {

while (j > 0 && s[i] != s[j])

Applications of prefix function

vector<int> kmp_algo(string text, string s) {

int n = s.length(), m = text.length();

string str = s + "#" + text;

vector<int> pi = prefix_function(str), ans;

for (int i = n + 1; i <= n + m; i++) { /* n + 1 is where the text starts */

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.