0% found this document useful (0 votes)
16 views10 pages

String Matching and Hashing

The document discusses string algorithms, focusing on string hashing and the prefix function. String hashing allows for efficient string comparison by converting strings into hash values, while the prefix function helps identify the longest proper prefix which is also a suffix. Applications of these algorithms include the Rabin-Karp algorithm for string matching and the KMP algorithm for finding substring occurrences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

String Matching and Hashing

The document discusses string algorithms, focusing on string hashing and the prefix function. String hashing allows for efficient string comparison by converting strings into hash values, while the prefix function helps identify the longest proper prefix which is also a suffix. Applications of these algorithms include the Rabin-Karp algorithm for string matching and the KMP algorithm for finding substring occurrences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

String Algorithms

1) String Hashing

Comparing two strings efficiently

● The brute force way of doing so is just to compare the letters of both strings,
which has a time complexity of O(min(n1,n2)) where n1 and n2 are the length of
two strings.
● The idea behind string hashing is the following: we convert each string into an
integer and compare those instead of the strings. Comparing two strings is then
an O(1) operation!!! For the conversion, we need a so-called hash function and
corresponding hash value.
● A hash value of a string is a number that is calculated from the characters of the
string. If two strings are the same, their hash values are also the same, which
makes it possible to compare strings based on their hash values. The function
which is used to calculate the hash value of the string is called a hash function.

Calculation of the hash of a string

● So basically for a string S = s1,s2,s3,......,sn, we want to assign to it a unique


number (in general a is assigned 1, b is assigned 2 ,....., z is assigned 26 )
which can be calculated from information stored in S. We can store, say, the
sum of characters in S. But two different strings may evaluate to the same hash.
Eg- hash(“abc”) = 1+2+3 = 6, hash(“aad”) = 1+1+4 = 6.
● So, we have to define intelligent hashes, so as to minimise collisions (a
collision occurs if two different strings evaluate to the same hash).
● A usual way to implement string hashing is polynomial rolling hash function.
2 3 𝑛−1
ℎ𝑎𝑠ℎ[𝑠] = 𝑠[0] + 𝑠[1]. 𝑝 + 𝑠[2]. 𝑝 + 𝑠[3]. 𝑝 +....... + 𝑠[𝑛 − 1]. 𝑝 𝑚𝑜𝑑 𝑚
𝑛−1
𝑖
= ∑ 𝑠[𝑖] . 𝑝 𝑚𝑜𝑑 𝑚
𝑖=0

where n is the length of the string, m and p are some chosen, positive numbers.
● It is reasonable to make p a prime number roughly equal to the number of
characters in the input alphabet. For example, if the input is composed of only
lowercase letters of the English alphabet, p=31 is a good choice.
CODE
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) %
m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}

Reducing the chance of collision


● We should keep m such that m*m doesn’t result in overflow in C/C++. Also, a
good choice for m is some large prime number like 109+9.
● We can just compute two different hashes for each string (by using two different
p, and/or different m), and compare these pairs instead.

Applications of hashing
1. Calculating hash value of any substring in O(1) after
preprocessing-
The idea is to construct an array hash such that hash[k] contains the hash value
of the prefix s[0...k]. Now,
2 𝑗−𝑖
ℎ𝑎𝑠ℎ(𝑠[𝑖... 𝑗]) = 𝑠[𝑖] + 𝑠[𝑖 + 1]. 𝑝 + 𝑠[𝑖 + 2]. 𝑝 +..... + 𝑠[𝑗]. 𝑝 𝑚𝑜𝑑 𝑚
𝑗
𝑘−𝑖
ℎ𝑎𝑠ℎ(𝑠[𝑖…𝑗]) = ∑ 𝑠[𝑘]⋅𝑝 𝑚𝑜𝑑 𝑚
𝑘=𝑖

We can compute the hash of any substring directly using this formula.For this,
we must be able to divide hash(s[0…j])−hash(s[0…i−1]) by pi .Therefore we
need to find the modular multiplicative inverse of pi and then perform
multiplication with this inverse. We can pre compute all the inverse which allows
computing the hash of any substring of O(1) time.

𝑗
𝑖 𝑘
ℎ𝑎𝑠ℎ(𝑠[𝑖…𝑗])⋅𝑝 = ∑ 𝑠[𝑘]⋅𝑝 𝑚𝑜𝑑 𝑚
𝑘=𝑖

= ℎ𝑎𝑠ℎ(𝑠[0…𝑗]) − ℎ𝑎𝑠ℎ(𝑠[0…𝑖 − 1]) 𝑚𝑜𝑑 𝑚


Rather than calculating the hashes of substring exactly, it is enough to compute
the hash multiplied by some power of p as we are only concerned about
matching two strings and not finding the exact hash value.Suppose we have two
𝑖 𝑗
hashes of two substrings, one multiplied by 𝑝 and the other by 𝑝 . If i<j then we
𝑗−𝑖
multiply the first hash by 𝑝 .

2. Rabin-Karp algorithm for string matching-


Problem: Given two strings - a pattern s and a text t, determine if the
pattern appears in the text and if it does, enumerate all its occurrences in
O(|s|+|t|) time.

Algorithm- Calculate the hash value of string s and hash values for all the prefix
of text t. Now we will compare all substrings of length |s| of string t with the string
s by the method described above. Complexity for each comparison will be O(1).
O(|s|) is required for calculating the hash of the pattern and O(|t|) for comparing
each substring of length |s| with the pattern, hence net complexity is O(|s|+|t|).

CODE:
vector<int> rabin_karp(string const& s, string const& t) {
const int p = 31;
const int m = 1e9 + 9;
int S = s.size(), T = t.size();

vector<long long> p_pow(max(S, T));


p_pow[0] = 1;
for (int i = 1; i < (int)p_pow.size(); i++)
p_pow[i] = (p_pow[i-1] * p) % m;

vector<long long> h(T + 1, 0);


for (int i = 0; i < T; i++)
h[i+1] = (h[i] + (t[i] - 'a' + 1) * p_pow[i]) % m;
long long h_s = 0;
for (int i = 0; i < S; i++)
h_s = (h_s + (s[i] - 'a' + 1) * p_pow[i]) % m;

vector<int> occurences;
for (int i = 0; i + S - 1 < T; i++) {
long long cur_h = (h[i+S] + m - h[i]) % m;
if (cur_h == h_s * p_pow[i] % m)
occurences.push_back(i);
}
return occurences;
}

3. Calculating the number of different substrings of a string.


4. Calculating the number of palindromic substrings in a string.
5. Search for duplicate strings in an array of strings.

NOTE:- Using hashing will not be 100% deterministically correct, because two complete
different strings might have the same hash (the hashes collide). The solution of hashing can
be hacked. So, it is advisable to minimise the use of hashing and apply other algorithms
discussed later as much as you can.

2.) Prefix Function

Introduction

You are given a string s of length n. The prefix function for this string is defined
as an array π of length n, where π[i] is the length of the longest proper prefix of
the substring s[0…i] which is also a suffix of this substring. A proper prefix of a
string is a prefix that is not equal to the string itself. By definition, π[0]=0
because string of length 1 has no proper prefix.

For example, prefix function of string "abcabcd" is [0,0,0,1,2,3,0], and prefix


function of string "aabaaab" is [0,1,0,1,2,2,3].

Naive Algorithm - O(n 3):

Iterate over the string and for each index calculate its prefix value.For all O(n)
lengths from 1 to n, try all possible O(n) lengths and check whether prefix is
equal to suffix with each comparison takes O(n) time.

vector<int> prefix_function(string s) {

int n = (int)s.length();

vector<int> pi(n);
for (int i = 0; i < n; i++)

for (int k = 0; k <= i; k++)

if (s.substr(0, k) == s.substr(i-k+1, k))

pi[i] = k;

return pi;

Improved Algorithm - O(n 2):

Optimization 1- The first observation we make is that π[i+1]≤π[i]+1. That is,


the values of the prefix function can only increase by at most one.

Proof- Indeed, otherwise, if π[i+1]>π[i]+1, then we can take this suffix ending in
position i+1 with the length π[i+1] and remove the last character from it. We end
up with a suffix ending in position i with the length π[i+1]−1, which is better than
π[i], i.e. we get a contradiction.
The above fact allows us to reduce the complexity of the algorithm to O(n2),
because in one step the prefix function can grow at most by one. In total the
function can grow at most n steps, and therefore also only can decrease a total
of n steps. This means we only have to perform O(n) string comparisons, and
reach the complexity O(n2).

Final Algorithm - O(n):


Optimization 2-
● We first try extending the suffix of length j=π[i] by checking if s[j]=s[i+1]. If that
works, cool, we’re done already for that length!

● Otherwise, we need to find the next-longest possible j that still works as an equal

prefix/suffix for i, that is, the next-longest suffix length that we could “extend” from i

i.e. s[0…j−1]=s[i−j+1…i]. And if that j doesn’t work, we find the next-longest j


again, and so on. It can happen that this goes until j=0.. If then s[i+1]=s[0], we
will assign π[i+1]=1, and π[i+1]=0 otherwise.

● The only question left is how do we effectively find the lengths for j!!
We’re looking for the largest k<j such that k is also a valid prefix/suffix length.The
illustration shows that this has to be the value of π[j−1], which we already
calculated earlier.
Proof: In the above illustration let string1 = s0s1s2s3 and string2 = si-3si-2si-1si and
j= π[i]. By the definition of prefix function, string1 = string2. Now k is the largest
value <j such that suffix of string2 = prefix of the whole string (which is equal
to the prefix of string1 since k<length of string1). Also, suffix of string 2 =
suffix of string 1 (since both strings are equal) which ultimately implies, suffix
of string 1 = prefix of string 1, and thus we have to select the largest value of
k possible for the above equality, which is nothing other than the value
Π[3] 𝑜𝑟 Π[𝑗 − 1].So, k = π[j−1].
● This means the algorithm actually comes out to be really simple: set j to π[i], then
while j>0 and the current j doesn’t work, keep setting j to π[j−1]. Notice that we
still have at most O(n) string comparisons, but each one only compares two
characters, so it’s now O(1) per comparison. That means we’ve done it - we can
compute the whole prefix function in O(n)!

CODE

vector<int> prefix_function(string s) {

int n = (int)s.length();

vector<int> pi(n);

for (int i = 1; i < n; i++) {

int j = pi[i-1];

while (j > 0 && s[i] != s[j])

j = pi[j-1];

if (s[i] == s[j])

j++;

pi[i] = j;

return pi;

Applications of prefix function


1. KMP Algorithm-
Problem: Given a text t and a string s, we want to find and display the
positions of all occurrences of the string s in the text t.
Algorithm:We generate the string s+#+t, where # is a separator that appears
neither in s nor in t. Let us compute the prefix function for this string. Notice how
the value of the prefix function will never exceed ∣s∣, because if so, some character
would have to equal the separator, which is not possible.If equality π[i]=n(length of
the string s) is achieved, then it means that the string s appears completely in at
this position, i.e. it ends at position i. It means we’ve actually found a match that
ends at i! Thus if at some position i we have π[i]=n, then at the position
i−(n+1)−n+1=i−2n in the string t the string s appears.

In total, this is O(∣s∣+∣t∣) in both time and memory. However, you can optimize it
to

O(∣s∣) memory by not explicitly storing the prefix values after the separator as it
isn’t necessary. The implementation below does not do this.

CODE:

vector<int> kmp_algo(string text, string s) {

int n = s.length(), m = text.length();

string str = s + "#" + text;

vector<int> pi = prefix_function(str), ans;

for (int i = n + 1; i <= n + m; i++) { /* n + 1 is where the text starts */

if (pi[i] == n) {

ans.push_back(i - 2 * n); /* i - (n - 1) - (n + 1) */

return ans;

}
2. The number of different substring in a string
3. Compressing a string
4. Counting the number of occurrences of each prefix

Practice Problems:
http://www.spoj.com/problems/NAJPF/

http://codeforces.com/problemset/problem/271/D

https://codeforces.com/problemset/problem/835/D

http://codeforces.com/contest/808/problem/G

http://www.spoj.com/problems/SUFEQPRE/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy