An Approximate String-Matching Algorithm: Jong Yong Kim and John Shawe-Taylor QF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

CORE Metadata, citation and similar papers at core.ac.

uk
Provided by Elsevier - Publisher Connector

Theoretical Computer Science 92 (1992) 107-I 17 107


Elsevier

An approximate string-matching
algorithm
Jong Yong Kim and John Shawe-Taylor
Depurtment qf ConlputerSciencr. Royal Holloway and Beclford New College, University of London,
Egham, Surrey, TW20 OEX, UK

Kim, J.Y. and J. Shawe-Taylor, An approximate string-matching algorithm, Theoretical Computer


Science 92 (1992) 107- I 17.

An approximate string-matching algorithm is described based on earlier attribute-matching algo-


rithms. The algorithm involves building a trie from the text string which takes time O(N log, N), for
a text string of length N. Once this data structure has been built any number of approximate
searches can be made for pattern strings of length m. The expected complexity analysis is given for
the look-up phase of the algorithm based on certain regularity assumptions about the background
language. The expected look-up time for each pattern is O(m log, N). The ideas employed in the
algorithm have been shown effective in practice before, but have not previously received any
theoretical analysis.

1. The motivation for approximate string-matching

In many everyday situations the same information can be stored or queried in


various formats (e.g. prefix (suffix) variation, spelling variation, and synonyms etc.). In
these cases, we may not be able to retrieve the information using an exact matching
algorithm even if it is held by the system. In other situations, the stored or queried
information may have been slightly distorted through a noisy communication channel
or human typing error, again making exact matching flawed.
Even more extreme are situations where a perfect match is unlikely to occur such as
in image and signal processing, pattern recognition, speech interface, and recognizing
bird song, etc. In these cases the same information is hardly ever represented in
a unique form.
There are many related problems, such as finding the longest common subsequence
in a sequence for molecular biology, which also require string algorithms with
a tolerance for minor discrepancies. Though we do not consider these problems in this
paper it is felt that the techniques will generalise to many different string algorithms.

0304-3975/92/$05.00 :c 1992--Elsevier Science Publishers B.V. All rights reserved


108 .I. Y. Kim, J. Shawe-Taylor

The same approach is also applicable to partial information retrieval in a poorly


structured information system or in document searching.

2. Review of previous methods

As approximate string-matching methods have been widely used, there is a large


and diffuse literature on the various methods [2, lo]. We divide the methods into two
groups. First we consider techniques for which a complete complexity analysis has
been made. In the second subsection we review a family of related methods which have
proved effective in practice but which lack complexity information. The purpose of
this paper is to put an algorithm from the second group on a firmer theoretical basis.

2.1. Algorithms wlith complexity results

l Dynamic programming method [13]


Suppose that a pattern string P is arranged along the top and a text string T down
the side in two-dimensional array, and F(i,j) is the minimum number of differences
(edit distance) between P and T. A function F(i,j) is calculated iteratively using the
recurrence relations below.

F-(0,0)=0

F(O,j)=O

F(i, 0) = i

F(i,j)=min[F(i- l,j)+ l,F(i,j- l)+ l,F(i-l,j- l)+d(P,, 7J],

where d(e, rj)=O if pi= rj,

= 1 otherwise.

To allow reversal errors between neighbours, the following fourth term is added to the
minimization;

The dynamic programming method takes O(mn) operations sequentially to produce


its best matches, where m is the length of the pattern and n the text.
a Hypercube algorithms [S]
This work is essentially a parallelization of the dynamic programming technique.
To compute edit distance in a matrix, firstly it uses a divide and conquer method to
decompose the problem into n2 blocks on a hypercube, and then these blocks are
merged pairwise to form a collection of n2/2 blocks. This is continued until eventually
leading to 1 block, which are called the perimeter-pairs shortest path problem. The
An approximatr string-matching algorithm 109

algorithm shows that the string edit problem can be solved in O(log’n) time on
a virtual SIMD hypercube of 0(n3/log2 n) processors.
a Parallel algorithm using @ix tree [ 131
The parallel algorithm runs in two steps. Firstly, it concatenates the text and the
pattern to one string and builds the suffix tree of this string. Secondly, it finds all
occurrences of the pattern in the text with at most k differences using an alternative
dynamic programming method and the lowest common ancestor problem in the suffix
tree. The suffix tree is built in O(logn) time on n processors, and searching for the
lowest common ancestor in the tree takes 0( 1) time for x parallel queries on x proces-
sors. The alternative method employs (n + k + 1) processors (one per diagonal) and
computes the diagonals within at most (k+ 1) times in parallel. Therefore, the total
complexity of the algorithm is O(log n + k) to match with text size n and pattern size
m with k differences.

2.2. Attribute matching algorithms

l Two-index sequential jile [4]


Make two copies of a file, one in which the keys (strings) are in normal alphabetical
order and another in which they are alphabetically ordered by their reverse spelling
(e.g. blaze comes after blue). A misspelled string will probably agree up to half or more
of its length with an entry in one of these files. This method will be sensitive to the
actual distribution of the strings and no theoretical or empirical results concerning its
effectiveness are known.
l Soundex [l, 61
The goal is to transform a string into some code that tends to bring together all
variants of the same string. For example, in encoding surnames,
_ retain the first letter of the name dropping all the vowels.
~ assign the following number to the remaining letters:

b,f; P, z:- 1

c, y, j, k, q, s, x, z -+ 2, etc.

~ if the two letters with the same code were adjacent in the original name, omit all but
the first.
~ take the first four in the left-hand side from the above result.
An alternative to the above method: each letter is given a number from 3 to 9 as an
information weight (the relative frequency which is defaulted to typical English
frequency distribution),

a, e, i, n, 0, s -9 3

d, h, 1,r, u --+4, etc.

Based on these weights. if a letter is present in both a pattern and a text string, the
weight value is added to the likeness value. But each letter of the alphabet contributes
110 J. Y. Kim, J. Shawr-Taylor

to the likeness value at most once. This method would be most effective for misspelled
names that sounded phonetically like the correct spelling under the assumption that
all keys are distinct.
l Fast approximate string-matching [7]
This method is based on a binary n-gram table to be used for reducing a search
space, prefiltering candidates by a given threshold and, finally, a similarity measure
based on the Levenshtein metric to give a matched set of strings within the threshold.
It is very similar to our method in its overall matching procedures except for the data
structures. The table is drastically reduced by overloading (superimposed coding)
strings in horizontal and n-grams in vertical. In this way, a significant saving in space
is made and this in turn reduces elapsed times by reducing I/O loading. However, no
theoretical background is supported. Our analysis indicates why this is such an
effective approach.
l Positional (nonpositional) binary n-gram statistics [S, 121
- Construction of bit matrices: For each dictionary word in turn, any n characters
could be extracted to form the n-grams. For each n-grams x a “1” is entered in
a bit matrix at the address computed by

26”X+26”-‘~~+.~.+x,_~, (*)
where X = 0 for a nonpositional n-gram and is, otherwise, the sequence number.
By a random superimposed coding, the storage occupied by the bit matrix can be
reduced.
~ Error detection: The method tests whether a given word is not in the dictionary
by using the stored bit matrix: for each of the n-grams of the word, use (*) to
address one bit. If any such bit is 0 then the given input word cannot be in the
dictionary.
_ Error correction: Let us assume that R of the n-grams detects an error; for the rth
such n-gram Ni, ,,,.j, construct the subset fir = (i, . . . ,j ). This implies that an error
must have occurred at a position from i to j, or at more than one position. If only
a single symbol is in error, then there must be some position i that occurs in every
/II. Thus, we need only the intersection of these subsets p= r)i&, r= 1, . . . . 1. In
n errors, it can be determined by enumerating (k) pairs of positions that could
involve n errors and checking whether the positions cover all fir cases, where 1 is
the number of PI.
This method is limited to only replacement errors such as might occur in an optical
character recognition system (OCR). If there is an error in more than one position, it is
difficult to fix the error positions. But the computational time of the method is
independent of the word and dictionary size. The article shows the error detection,
error correction rate and density of the binary matrix on various sets of positional
(nonpositional) n-grams. Especially, positional tri-grams show how the error, reject,
and correction rates relate to the size of the word set.
The purpose of this paper is to introduce an approximate string-matching algo-
rithm adapted from [7] above and to give an expected complexity analysis for the
An approximate string-matching algorithm 111

look-up phase of the algorithm. This is estimated to be O(mlog N) for a suitable


choice of parameters for text size N and pattern size m.

3. Concept of an n-gram

An n-gram is an n character contiguous substring of a given string. Hence, for


a string w1w2... wN with N characters there are at most N--n+ 1 distinct n-grams,
~~w~+~...w~+~-~,forj=l,..., N-n+1 (Fig. 1).
Further to this traditional definition of an n-gram, it can be helpful to group
n-grams together which are likely to have arisen as a result of simple errors, such as
swapping characters. So, for example, in the case when n = 3 the tri-grams may be
identified by the set of characters they contain, completely ignoring the ordering
information. It should be mentioned that we are not fixing the value of n at this stage.
Indeed, in the analysis it becomes necessary to choose n according to the size of the
database considered, in order to obtain the optimal bounds on the complexity of the
look-up procedure.
One of the key assumptions we will make in the analysis is that the number P(n) of
n-gram groups grows exponentially in n. This will be considered in the context of
a background language, where there is assumed to be an underlying probability pi > 0
for each n-gram group gi, where g 1, . . .,gpC,,) are the P(n) n-gram groups which do

I I I I __ _ _ _ _ _ _ , ,
String

c1 I
n-gram 1 liil 1 1 1

n-gram 2 I

I c

0
--II
I ’ ’

__-___--_c

r
n-gram N - n I 8 8 I

cc
n-gram N - n + 1 1
I

Fig. 1. Diagram of the n-gram structure (n = 3).


112 J. Y. Kim, J. Shawe-Taylor

occur. The other assumption which will be made about the background language is
that the probabilities pi are bounded by some constant K times l/P(n) independently
of n. This is saying that as we choose larger and larger n-grams, no particular string or
set of strings becomes excessively dominant.

4. Data structure

In the algorithm, we build two kinds of data structures which store the sections of
words and n-grams they contain. The first is a division of the dictionary into sections.
Each section will contain the same number I characters, though this number is not
specified at this stage. There are M sections and N = Ml characters in all. If the
boundaries between sections are not clearly defined it may be necessary to overlap
sections, but in order to simplify the analysis, this will not be considered.
The second structure is the “n-gram tree”. This is a trie whose nodes consist of an
array of pointers indexed by the alphabet. To find the leaf node corresponding to
a particular n-gram, we simply follow the pointers for each of the characters of the
n-gram in turn. The leaf node will itself contain a pointer to the list of sections for the
group containing the given n-gram.

5. Building the trie

Given a value for n and 1, we can build the tree by simply passing through the
dictionary once. At each stage we will be processing at most n n-grams and we will
need pointers into the trie for each of them. As the next character is taken each of the
pointers must move down the tree by using the appropriate pointer from its current
node. If necessary, a new node will be created. The work taken will be constant for
each individual pointer, and so will be proportional to n overall. When an n-gram is
completed we must install the number of the current section into the list for the
n-grams group. We will assume that we can determine the group of an n-gram when its
leaf node is first created using a method which takes O(n) time. Note that by working
through the file and always inserting the section numbers at the beginning of the list,
we ensure that the numbers are in reverse order without duplication.
The overall time to perform the construction of the trie is, therefore, O(Nn), where
N is the number of characters in the dictionary.

6. Searches for approximate strings

In order to look for a pattern string, we again follow the paths through the tree
corresponding to its n-grams and the lists of sections corresponding to these tree
nodes are merged adding the total numbers of occurrences for sections containing one
of the n-grams.
An approximate string-matching algorithm 113

Any section which contains more of the n-grams than a user-defined threshold is
passed to the second stage for exact matching. Since the set of n-grams of a string
corrupted through insertion, deletion or reversal errors is slightly altered, the algo-
rithm will include searching sections containing only a proportion of the full set of
n-grams. The method ensures, however, that any section containing the string or
a slightly corrupted version will necessarily be searched, so that no string that is
a sufficiently good match will be overlooked. It is expected that the number of sections
to be searched will be, typically, quite small. The complexity analysis estimates this
number of sections.
In summary, the overall algorithm is divided into three main parts as follows.
l Searching for the lists of sections which contain any of the n-grams contained in the

sought string
The n-gram tree is accessed to find the needed lists, which are kept separate so that
the number of n-grams which the individual sections contain can be calculated during
the next phase.
l Netting sections to remove sections under the threshold

Here the lists of sections corresponding to the individual n-grams are collated.
A new list of section numbers is created being those sections which occur in more than
a given fraction of the original lists. The key observation concerning the new list is
that any section which contains the string or a slightly corrupted version of it must be
contained in this list. This is because such a section will contain most of the sought
n-grams and so will occur in more than the required fraction of the individual n-gram
lists.
l Approximate string-matching between the pattern string and the text strings in the
sections
Each of the sections found at the previous stage has to be examined to see if the
string or some slight corruption of it occurs in it. Clearly, it is possible that sections
could have all of the sought n-grams and yet not contain the required string. It is,
therefore, necessary to check for the string using some standard approximate string-
matching technique. In the complexity analysis this is assumed to take time propor-
tional to the product of the length of the sought string and the length of the section.

7. Complexity analysis

We have already assessed the complexity of building the trie as O(Nn) in the
previous sections. This clearly depends on the choice of n. We begin this section by
considering the expected complexity of the look-up procedure. This analysis will
indicate an appropriate choice for n, hence indicating the actual complexity of
building the trie.
Each section of the dictionary has (I-n+ 1) n-grams. The n-grams will be par-
titioned into sets of related strings, which could have arisen from each other as a result
of typographical or other errors. The choice of these partitions will be crucial to the
114 J. Y. Kim, J. Shawe-Taylor

error-correcting performance of the algorithm, but does not concern the complexity
analysis in so far as the conditions outlined below are satisfied.
As described above, let there be P(n) n-gram partitions, gl, . , gpCn)with corres-
ponding probabilities (relative frequencies) pi, . . . , pPcn,.We denote this distribution by
D. We make the following further assumptions about these probabilities:
0 PidK/P(?l) for some constant K.
l P(n) grows exponentially with n, i.e. P(n)aa” for some a> 1.
These assumptions relate to the choice of partitioning and the properties of the
underlying language. In the case of natural language the second can be intuitively
justified by considering the L sentences of a particular length k. To make sentences of
length 2k, we can combine any two of the L sentences of length k. Hence, there are
more than L* sentences of length 2k. This is consistent with exponential growth in the
number of n-grams.
We let P* denote the following quantity:

P(n)
p*= c pf,
i=l

where we suppress the dependency on n in the notation. This implies

where b= l/cr< 1.
With a sought string of length m, we have (m + 1 -n) n-grams. Since for each n-gram
we simply make n moves down the trie, the expected complexity of accessing the
n-gram tree is

O(m(n + E)), (1)

where E is the expected number of sections in which an n-gram will occur. It is


assumed that the section numbers are stored in a linked list, so that access time is
bounded by the number of entries. As the list is accessed the number of occurrences in
the individual sections can also be calculated by keeping the sections in lists indexed
by the current number of the n-grams they contain.
By the definition of probability as relative frequency, the expected number of
occurrences of this n-gram in the N =lM characters of the dictionary is lMpi. The
An approximate string-matching algorithm 115

number E can be bounded above by the expected number of occurrences of an n-gram


chosen randomly according to the distribution D. Hence,
P(n)
Ed C Pi(lPi”)
i=l

P(n)
<lM c pf
i=l

< NP*. (2)

Finally, we must search any sections for which the threshold is surpassed. Let S be
the expected number of such sections. Then this will take

O(Slm). (3)
Suppose the threshold proportion is 19,then given (m + 1 -n) n-grams we require to
estimate the probability that Q(m + 1 -n) of these occur in a section. By the above the
probability that a randomly chosen n-gram occurs in a section in P*l. Hence, the
probabilities that more than dm n-grams occur independently in a section which does
not contain the sought string or a corruption of it is

S,M< 2 ‘:I (IP*)‘(l -lp*)“‘-’


i=Om’0
In’
m’
<(IP*p’
I( >
i=bh’
i

w
m’(1 -cl)
m’
<(lP*p’
i
i=O

m’(l-8)
<(lP*p’ em’
( m’(1 -0) 1

+P*)f&)‘“)”

where
i-e
and m’=m+ 1 -n.

Hence, the expected number of sections where the threshold is exceeded is

S d Mr”‘. (4)
We can, therefore, estimate the expected time-searching sections using (3) and (4),

O(Slm) < O(Nmr”). (5)


116 J. Y. Kim, J. Shawe-Taylor

From (l), (2) and (5), the expected access time is, therefore,

O(m(n + NP* + NY)), (6)


where the individual parts are derived as follows:
l mn: time accessing the n-gram tree,
l mNP*: time accessing the lists of sections and collating them,
l mNr”‘: time for the search for the string within the sections exceeding the threshold.
We now assume a choice of 0 = 0.5 and make the following choices for the size n of
the n-grams and size 1 of the sections into which the text is divided:

n =(log, N-log, log, N)/log, CI,

[=p*2/“-‘/2e.

In order to make these choices we would need to know r though this can be estimated
by simply building the tree. For the optimal choice the number of n-gram groups
should be N/log, N. This is because

log, P”=log2$

log, N
=log2N:

giving L?= N/log, N. This also means that P* <(K log, N)/N.
To choose I we should also know P* and the size m of the look-up strings. However,
we may safely take
p*Zln
I=--
2eP* ’

This is a slightly smaller size of 1. In this case,

ym=p**ln<p*

since we must always use look-up strings which are longer than a single n-gram.
Combining all of the above, we obtain the following complexity bound from (6) above:

log, N
O(m(n+NP*+Nr”))<O
((
m -+2NP*
1% a

log, N
-+2Klog2 N
log2 a

= O(m log, N).


An approximate string-matching algorithm 117

8. Conclusion

This paper has considered an algorithm which uses n-grams to estimate which
sections in a large text may contain a pattern string. These sections can then be
searched by standard approximate string-matching algorithms for the particular
pattern. This algorithm has been shown to perform very fast in practice [7], a fact
which a standard worst-case analysis would be unable to explain. This paper has
taken the novel approach of attempting an expected complexity analysis under certain
reasonable statistical assumptions about the language from which the text string has
been drawn. Using these assumptions a very impressive bound has been obtained on
the complexity, which reflects the performance observed in experiments.
A similar kind of analysis is performed on a substring search algorithm in a paper
by Shawe-Taylor [l 11. This paper also uses n-gram techniques. They appear to be
a very promising approach to substring problems, though analysis of their perform-
ance will necessarily rely on statistical assumptions.

References

[l] L. Davidson, Retrieval of misspelled names in an airlines passenger record system, Comm. ACM
5 (1962) 169-171.
[2] G.R. Dowling and P. Hall, Approximate string matching, ACM Comput. Surceys 12 (4) (1980)
381-402.
[3] O.H. Ibarra, T.C. Pong and SM. Sohn, String Processing on the Hypercube, IEEE Trans. Acoust.
Speech Signal Process. 38 (1) (1990) 160-164.
[4] D.E. Knuth, Sorting and Searching (Addison-Wesley, Reading, MA, 1973).
[5] G.H. Landau and U. Vishkin, Fast parallel and serial approximate string-matching, J. Algorithms 10
(1989) 157-169.
[6] A.B. Michael, Automatic correction to misspelled names: a fourth generation language approach,
Comm. ACM (1987) 224-228.
[7] 0. Owolabi and D.R. McGregor, Fast approximate string-matching, Software-Practice and Experi-
ence 18 (4) (1988) 387-393.
[S] E.M. Riseman and A.R. Hanson, A contextual postprocessing system for error correction using binary
n-grams, IEEE Trans. Comput. 5 (1974) 480-493.
[9] D. Sankoff and J.B. Kruskal, Time Warps, String Edit, and Macromolecules: The Theory and Practice
ofSequence Comparison (Addison-Wesley, Reading, MA, 1983).
[lo] S.N. Srihari, ed., Computer text recognition and error correction, Tutorial, IEEE Computer Society
Press, 1984.
[ll] John Shawe-Taylor, Fast string matching in a stationary ergodic source, Tech. Report RHBNC,
University of London CSD-TR-633, 1990.
1121 J.R. Ullmann, A binary n-gram technique for automatic correction of substitution, deletion, insertion
and reversal errors in words, Comput. J. 20 (2) (1977) 141-147.
1133 R.A. Wagner and M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 168-178.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy