An Approximate String-Matching Algorithm: Jong Yong Kim and John Shawe-Taylor QF
An Approximate String-Matching Algorithm: Jong Yong Kim and John Shawe-Taylor QF
An Approximate String-Matching Algorithm: Jong Yong Kim and John Shawe-Taylor QF
uk
Provided by Elsevier - Publisher Connector
An approximate string-matching
algorithm
Jong Yong Kim and John Shawe-Taylor
Depurtment qf ConlputerSciencr. Royal Holloway and Beclford New College, University of London,
Egham, Surrey, TW20 OEX, UK
F-(0,0)=0
F(O,j)=O
F(i, 0) = i
= 1 otherwise.
To allow reversal errors between neighbours, the following fourth term is added to the
minimization;
algorithm shows that the string edit problem can be solved in O(log’n) time on
a virtual SIMD hypercube of 0(n3/log2 n) processors.
a Parallel algorithm using @ix tree [ 131
The parallel algorithm runs in two steps. Firstly, it concatenates the text and the
pattern to one string and builds the suffix tree of this string. Secondly, it finds all
occurrences of the pattern in the text with at most k differences using an alternative
dynamic programming method and the lowest common ancestor problem in the suffix
tree. The suffix tree is built in O(logn) time on n processors, and searching for the
lowest common ancestor in the tree takes 0( 1) time for x parallel queries on x proces-
sors. The alternative method employs (n + k + 1) processors (one per diagonal) and
computes the diagonals within at most (k+ 1) times in parallel. Therefore, the total
complexity of the algorithm is O(log n + k) to match with text size n and pattern size
m with k differences.
b,f; P, z:- 1
c, y, j, k, q, s, x, z -+ 2, etc.
~ if the two letters with the same code were adjacent in the original name, omit all but
the first.
~ take the first four in the left-hand side from the above result.
An alternative to the above method: each letter is given a number from 3 to 9 as an
information weight (the relative frequency which is defaulted to typical English
frequency distribution),
a, e, i, n, 0, s -9 3
Based on these weights. if a letter is present in both a pattern and a text string, the
weight value is added to the likeness value. But each letter of the alphabet contributes
110 J. Y. Kim, J. Shawr-Taylor
to the likeness value at most once. This method would be most effective for misspelled
names that sounded phonetically like the correct spelling under the assumption that
all keys are distinct.
l Fast approximate string-matching [7]
This method is based on a binary n-gram table to be used for reducing a search
space, prefiltering candidates by a given threshold and, finally, a similarity measure
based on the Levenshtein metric to give a matched set of strings within the threshold.
It is very similar to our method in its overall matching procedures except for the data
structures. The table is drastically reduced by overloading (superimposed coding)
strings in horizontal and n-grams in vertical. In this way, a significant saving in space
is made and this in turn reduces elapsed times by reducing I/O loading. However, no
theoretical background is supported. Our analysis indicates why this is such an
effective approach.
l Positional (nonpositional) binary n-gram statistics [S, 121
- Construction of bit matrices: For each dictionary word in turn, any n characters
could be extracted to form the n-grams. For each n-grams x a “1” is entered in
a bit matrix at the address computed by
26”X+26”-‘~~+.~.+x,_~, (*)
where X = 0 for a nonpositional n-gram and is, otherwise, the sequence number.
By a random superimposed coding, the storage occupied by the bit matrix can be
reduced.
~ Error detection: The method tests whether a given word is not in the dictionary
by using the stored bit matrix: for each of the n-grams of the word, use (*) to
address one bit. If any such bit is 0 then the given input word cannot be in the
dictionary.
_ Error correction: Let us assume that R of the n-grams detects an error; for the rth
such n-gram Ni, ,,,.j, construct the subset fir = (i, . . . ,j ). This implies that an error
must have occurred at a position from i to j, or at more than one position. If only
a single symbol is in error, then there must be some position i that occurs in every
/II. Thus, we need only the intersection of these subsets p= r)i&, r= 1, . . . . 1. In
n errors, it can be determined by enumerating (k) pairs of positions that could
involve n errors and checking whether the positions cover all fir cases, where 1 is
the number of PI.
This method is limited to only replacement errors such as might occur in an optical
character recognition system (OCR). If there is an error in more than one position, it is
difficult to fix the error positions. But the computational time of the method is
independent of the word and dictionary size. The article shows the error detection,
error correction rate and density of the binary matrix on various sets of positional
(nonpositional) n-grams. Especially, positional tri-grams show how the error, reject,
and correction rates relate to the size of the word set.
The purpose of this paper is to introduce an approximate string-matching algo-
rithm adapted from [7] above and to give an expected complexity analysis for the
An approximate string-matching algorithm 111
3. Concept of an n-gram
I I I I __ _ _ _ _ _ _ , ,
String
c1 I
n-gram 1 liil 1 1 1
n-gram 2 I
I c
0
--II
I ’ ’
__-___--_c
r
n-gram N - n I 8 8 I
cc
n-gram N - n + 1 1
I
occur. The other assumption which will be made about the background language is
that the probabilities pi are bounded by some constant K times l/P(n) independently
of n. This is saying that as we choose larger and larger n-grams, no particular string or
set of strings becomes excessively dominant.
4. Data structure
In the algorithm, we build two kinds of data structures which store the sections of
words and n-grams they contain. The first is a division of the dictionary into sections.
Each section will contain the same number I characters, though this number is not
specified at this stage. There are M sections and N = Ml characters in all. If the
boundaries between sections are not clearly defined it may be necessary to overlap
sections, but in order to simplify the analysis, this will not be considered.
The second structure is the “n-gram tree”. This is a trie whose nodes consist of an
array of pointers indexed by the alphabet. To find the leaf node corresponding to
a particular n-gram, we simply follow the pointers for each of the characters of the
n-gram in turn. The leaf node will itself contain a pointer to the list of sections for the
group containing the given n-gram.
Given a value for n and 1, we can build the tree by simply passing through the
dictionary once. At each stage we will be processing at most n n-grams and we will
need pointers into the trie for each of them. As the next character is taken each of the
pointers must move down the tree by using the appropriate pointer from its current
node. If necessary, a new node will be created. The work taken will be constant for
each individual pointer, and so will be proportional to n overall. When an n-gram is
completed we must install the number of the current section into the list for the
n-grams group. We will assume that we can determine the group of an n-gram when its
leaf node is first created using a method which takes O(n) time. Note that by working
through the file and always inserting the section numbers at the beginning of the list,
we ensure that the numbers are in reverse order without duplication.
The overall time to perform the construction of the trie is, therefore, O(Nn), where
N is the number of characters in the dictionary.
In order to look for a pattern string, we again follow the paths through the tree
corresponding to its n-grams and the lists of sections corresponding to these tree
nodes are merged adding the total numbers of occurrences for sections containing one
of the n-grams.
An approximate string-matching algorithm 113
Any section which contains more of the n-grams than a user-defined threshold is
passed to the second stage for exact matching. Since the set of n-grams of a string
corrupted through insertion, deletion or reversal errors is slightly altered, the algo-
rithm will include searching sections containing only a proportion of the full set of
n-grams. The method ensures, however, that any section containing the string or
a slightly corrupted version will necessarily be searched, so that no string that is
a sufficiently good match will be overlooked. It is expected that the number of sections
to be searched will be, typically, quite small. The complexity analysis estimates this
number of sections.
In summary, the overall algorithm is divided into three main parts as follows.
l Searching for the lists of sections which contain any of the n-grams contained in the
sought string
The n-gram tree is accessed to find the needed lists, which are kept separate so that
the number of n-grams which the individual sections contain can be calculated during
the next phase.
l Netting sections to remove sections under the threshold
Here the lists of sections corresponding to the individual n-grams are collated.
A new list of section numbers is created being those sections which occur in more than
a given fraction of the original lists. The key observation concerning the new list is
that any section which contains the string or a slightly corrupted version of it must be
contained in this list. This is because such a section will contain most of the sought
n-grams and so will occur in more than the required fraction of the individual n-gram
lists.
l Approximate string-matching between the pattern string and the text strings in the
sections
Each of the sections found at the previous stage has to be examined to see if the
string or some slight corruption of it occurs in it. Clearly, it is possible that sections
could have all of the sought n-grams and yet not contain the required string. It is,
therefore, necessary to check for the string using some standard approximate string-
matching technique. In the complexity analysis this is assumed to take time propor-
tional to the product of the length of the sought string and the length of the section.
7. Complexity analysis
We have already assessed the complexity of building the trie as O(Nn) in the
previous sections. This clearly depends on the choice of n. We begin this section by
considering the expected complexity of the look-up procedure. This analysis will
indicate an appropriate choice for n, hence indicating the actual complexity of
building the trie.
Each section of the dictionary has (I-n+ 1) n-grams. The n-grams will be par-
titioned into sets of related strings, which could have arisen from each other as a result
of typographical or other errors. The choice of these partitions will be crucial to the
114 J. Y. Kim, J. Shawe-Taylor
error-correcting performance of the algorithm, but does not concern the complexity
analysis in so far as the conditions outlined below are satisfied.
As described above, let there be P(n) n-gram partitions, gl, . , gpCn)with corres-
ponding probabilities (relative frequencies) pi, . . . , pPcn,.We denote this distribution by
D. We make the following further assumptions about these probabilities:
0 PidK/P(?l) for some constant K.
l P(n) grows exponentially with n, i.e. P(n)aa” for some a> 1.
These assumptions relate to the choice of partitioning and the properties of the
underlying language. In the case of natural language the second can be intuitively
justified by considering the L sentences of a particular length k. To make sentences of
length 2k, we can combine any two of the L sentences of length k. Hence, there are
more than L* sentences of length 2k. This is consistent with exponential growth in the
number of n-grams.
We let P* denote the following quantity:
P(n)
p*= c pf,
i=l
where b= l/cr< 1.
With a sought string of length m, we have (m + 1 -n) n-grams. Since for each n-gram
we simply make n moves down the trie, the expected complexity of accessing the
n-gram tree is
P(n)
<lM c pf
i=l
Finally, we must search any sections for which the threshold is surpassed. Let S be
the expected number of such sections. Then this will take
O(Slm). (3)
Suppose the threshold proportion is 19,then given (m + 1 -n) n-grams we require to
estimate the probability that Q(m + 1 -n) of these occur in a section. By the above the
probability that a randomly chosen n-gram occurs in a section in P*l. Hence, the
probabilities that more than dm n-grams occur independently in a section which does
not contain the sought string or a corruption of it is
w
m’(1 -cl)
m’
<(lP*p’
i
i=O
m’(l-8)
<(lP*p’ em’
( m’(1 -0) 1
+P*)f&)‘“)”
where
i-e
and m’=m+ 1 -n.
S d Mr”‘. (4)
We can, therefore, estimate the expected time-searching sections using (3) and (4),
From (l), (2) and (5), the expected access time is, therefore,
[=p*2/“-‘/2e.
In order to make these choices we would need to know r though this can be estimated
by simply building the tree. For the optimal choice the number of n-gram groups
should be N/log, N. This is because
log, P”=log2$
log, N
=log2N:
giving L?= N/log, N. This also means that P* <(K log, N)/N.
To choose I we should also know P* and the size m of the look-up strings. However,
we may safely take
p*Zln
I=--
2eP* ’
ym=p**ln<p*
since we must always use look-up strings which are longer than a single n-gram.
Combining all of the above, we obtain the following complexity bound from (6) above:
log, N
O(m(n+NP*+Nr”))<O
((
m -+2NP*
1% a
log, N
-+2Klog2 N
log2 a
8. Conclusion
This paper has considered an algorithm which uses n-grams to estimate which
sections in a large text may contain a pattern string. These sections can then be
searched by standard approximate string-matching algorithms for the particular
pattern. This algorithm has been shown to perform very fast in practice [7], a fact
which a standard worst-case analysis would be unable to explain. This paper has
taken the novel approach of attempting an expected complexity analysis under certain
reasonable statistical assumptions about the language from which the text string has
been drawn. Using these assumptions a very impressive bound has been obtained on
the complexity, which reflects the performance observed in experiments.
A similar kind of analysis is performed on a substring search algorithm in a paper
by Shawe-Taylor [l 11. This paper also uses n-gram techniques. They appear to be
a very promising approach to substring problems, though analysis of their perform-
ance will necessarily rely on statistical assumptions.
References
[l] L. Davidson, Retrieval of misspelled names in an airlines passenger record system, Comm. ACM
5 (1962) 169-171.
[2] G.R. Dowling and P. Hall, Approximate string matching, ACM Comput. Surceys 12 (4) (1980)
381-402.
[3] O.H. Ibarra, T.C. Pong and SM. Sohn, String Processing on the Hypercube, IEEE Trans. Acoust.
Speech Signal Process. 38 (1) (1990) 160-164.
[4] D.E. Knuth, Sorting and Searching (Addison-Wesley, Reading, MA, 1973).
[5] G.H. Landau and U. Vishkin, Fast parallel and serial approximate string-matching, J. Algorithms 10
(1989) 157-169.
[6] A.B. Michael, Automatic correction to misspelled names: a fourth generation language approach,
Comm. ACM (1987) 224-228.
[7] 0. Owolabi and D.R. McGregor, Fast approximate string-matching, Software-Practice and Experi-
ence 18 (4) (1988) 387-393.
[S] E.M. Riseman and A.R. Hanson, A contextual postprocessing system for error correction using binary
n-grams, IEEE Trans. Comput. 5 (1974) 480-493.
[9] D. Sankoff and J.B. Kruskal, Time Warps, String Edit, and Macromolecules: The Theory and Practice
ofSequence Comparison (Addison-Wesley, Reading, MA, 1983).
[lo] S.N. Srihari, ed., Computer text recognition and error correction, Tutorial, IEEE Computer Society
Press, 1984.
[ll] John Shawe-Taylor, Fast string matching in a stationary ergodic source, Tech. Report RHBNC,
University of London CSD-TR-633, 1990.
1121 J.R. Ullmann, A binary n-gram technique for automatic correction of substitution, deletion, insertion
and reversal errors in words, Comput. J. 20 (2) (1977) 141-147.
1133 R.A. Wagner and M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 168-178.