String Matching and Hashing
String Matching and Hashing
1) String Hashing
● The brute force way of doing so is just to compare the letters of both strings,
which has a time complexity of O(min(n1,n2)) where n1 and n2 are the length of
two strings.
● The idea behind string hashing is the following: we convert each string into an
integer and compare those instead of the strings. Comparing two strings is then
an O(1) operation!!! For the conversion, we need a so-called hash function and
corresponding hash value.
● A hash value of a string is a number that is calculated from the characters of the
string. If two strings are the same, their hash values are also the same, which
makes it possible to compare strings based on their hash values. The function
which is used to calculate the hash value of the string is called a hash function.
where n is the length of the string, m and p are some chosen, positive numbers.
● It is reasonable to make p a prime number roughly equal to the number of
characters in the input alphabet. For example, if the input is composed of only
lowercase letters of the English alphabet, p=31 is a good choice.
CODE
long long compute_hash(string const& s) {
const int p = 31;
const int m = 1e9 + 9;
long long hash_value = 0;
long long p_pow = 1;
for (char c : s) {
hash_value = (hash_value + (c - 'a' + 1) * p_pow) %
m;
p_pow = (p_pow * p) % m;
}
return hash_value;
}
Applications of hashing
1. Calculating hash value of any substring in O(1) after
preprocessing-
The idea is to construct an array hash such that hash[k] contains the hash value
of the prefix s[0...k]. Now,
2 𝑗−𝑖
ℎ𝑎𝑠ℎ(𝑠[𝑖... 𝑗]) = 𝑠[𝑖] + 𝑠[𝑖 + 1]. 𝑝 + 𝑠[𝑖 + 2]. 𝑝 +..... + 𝑠[𝑗]. 𝑝 𝑚𝑜𝑑 𝑚
𝑗
𝑘−𝑖
ℎ𝑎𝑠ℎ(𝑠[𝑖…𝑗]) = ∑ 𝑠[𝑘]⋅𝑝 𝑚𝑜𝑑 𝑚
𝑘=𝑖
We can compute the hash of any substring directly using this formula.For this,
we must be able to divide hash(s[0…j])−hash(s[0…i−1]) by pi .Therefore we
need to find the modular multiplicative inverse of pi and then perform
multiplication with this inverse. We can pre compute all the inverse which allows
computing the hash of any substring of O(1) time.
𝑗
𝑖 𝑘
ℎ𝑎𝑠ℎ(𝑠[𝑖…𝑗])⋅𝑝 = ∑ 𝑠[𝑘]⋅𝑝 𝑚𝑜𝑑 𝑚
𝑘=𝑖
Algorithm- Calculate the hash value of string s and hash values for all the prefix
of text t. Now we will compare all substrings of length |s| of string t with the string
s by the method described above. Complexity for each comparison will be O(1).
O(|s|) is required for calculating the hash of the pattern and O(|t|) for comparing
each substring of length |s| with the pattern, hence net complexity is O(|s|+|t|).
CODE:
vector<int> rabin_karp(string const& s, string const& t) {
const int p = 31;
const int m = 1e9 + 9;
int S = s.size(), T = t.size();
vector<int> occurences;
for (int i = 0; i + S - 1 < T; i++) {
long long cur_h = (h[i+S] + m - h[i]) % m;
if (cur_h == h_s * p_pow[i] % m)
occurences.push_back(i);
}
return occurences;
}
NOTE:- Using hashing will not be 100% deterministically correct, because two complete
different strings might have the same hash (the hashes collide). The solution of hashing can
be hacked. So, it is advisable to minimise the use of hashing and apply other algorithms
discussed later as much as you can.
Introduction
You are given a string s of length n. The prefix function for this string is defined
as an array π of length n, where π[i] is the length of the longest proper prefix of
the substring s[0…i] which is also a suffix of this substring. A proper prefix of a
string is a prefix that is not equal to the string itself. By definition, π[0]=0
because string of length 1 has no proper prefix.
Iterate over the string and for each index calculate its prefix value.For all O(n)
lengths from 1 to n, try all possible O(n) lengths and check whether prefix is
equal to suffix with each comparison takes O(n) time.
vector<int> prefix_function(string s) {
int n = (int)s.length();
vector<int> pi(n);
for (int i = 0; i < n; i++)
pi[i] = k;
return pi;
Proof- Indeed, otherwise, if π[i+1]>π[i]+1, then we can take this suffix ending in
position i+1 with the length π[i+1] and remove the last character from it. We end
up with a suffix ending in position i with the length π[i+1]−1, which is better than
π[i], i.e. we get a contradiction.
The above fact allows us to reduce the complexity of the algorithm to O(n2),
because in one step the prefix function can grow at most by one. In total the
function can grow at most n steps, and therefore also only can decrease a total
of n steps. This means we only have to perform O(n) string comparisons, and
reach the complexity O(n2).
● Otherwise, we need to find the next-longest possible j that still works as an equal
prefix/suffix for i, that is, the next-longest suffix length that we could “extend” from i
● The only question left is how do we effectively find the lengths for j!!
We’re looking for the largest k<j such that k is also a valid prefix/suffix length.The
illustration shows that this has to be the value of π[j−1], which we already
calculated earlier.
Proof: In the above illustration let string1 = s0s1s2s3 and string2 = si-3si-2si-1si and
j= π[i]. By the definition of prefix function, string1 = string2. Now k is the largest
value <j such that suffix of string2 = prefix of the whole string (which is equal
to the prefix of string1 since k<length of string1). Also, suffix of string 2 =
suffix of string 1 (since both strings are equal) which ultimately implies, suffix
of string 1 = prefix of string 1, and thus we have to select the largest value of
k possible for the above equality, which is nothing other than the value
Π[3] 𝑜𝑟 Π[𝑗 − 1].So, k = π[j−1].
● This means the algorithm actually comes out to be really simple: set j to π[i], then
while j>0 and the current j doesn’t work, keep setting j to π[j−1]. Notice that we
still have at most O(n) string comparisons, but each one only compares two
characters, so it’s now O(1) per comparison. That means we’ve done it - we can
compute the whole prefix function in O(n)!
CODE
vector<int> prefix_function(string s) {
int n = (int)s.length();
vector<int> pi(n);
int j = pi[i-1];
j = pi[j-1];
if (s[i] == s[j])
j++;
pi[i] = j;
return pi;
In total, this is O(∣s∣+∣t∣) in both time and memory. However, you can optimize it
to
O(∣s∣) memory by not explicitly storing the prefix values after the separator as it
isn’t necessary. The implementation below does not do this.
CODE:
if (pi[i] == n) {
ans.push_back(i - 2 * n); /* i - (n - 1) - (n + 1) */
return ans;
}
2. The number of different substring in a string
3. Compressing a string
4. Counting the number of occurrences of each prefix
Practice Problems:
http://www.spoj.com/problems/NAJPF/
http://codeforces.com/problemset/problem/271/D
https://codeforces.com/problemset/problem/835/D
http://codeforces.com/contest/808/problem/G
http://www.spoj.com/problems/SUFEQPRE/