Shang Tries For Approximate String Matching

Tries for Approximate String Matching
H. Shang and T.H. Merrett September 8, 1995

Abstract
Tries o er text searches with costs which are independent of the size of the document being searched, and so are important for large documents requiring spelling checkers), case insensitivity, and limited approximate regular secondary storage. Approximate searches, in which the search pattern di ers from the document by k substitutions, transpositions, insertions or deletions, have hitherto been carried out only at costs linear in the size of the document. We present a trie-based method whose cost is independent of document size.
H. Shang and T.H. Merrett are at the School of Computer Science, McGill University, Montreal, Quebec, Canada H3A 2A7, Email: fshang, timg@cs.mcgill.ca
100
Our experiments show that this new method signi cantly outperforms the nearest competitor for k=0 and k=1, which are arguably the most important cases. The linear cost (in k) of the other methods begins to catch up, for our small les, only at k=2. For larger les, complexity arguments indicate that tries will outperform the linear methods for larger values of k. Trie indexes combine su xes and so are compact in storage. When the text itself does not need to be stored, as in a spelling checker, we even obtain negative overhead: 50% compression. We discuss a variety of applications and extensions, including best match (for spelling checkers), case insensitivity, and limited approximate regular expression matching.
1 Introduction
The need to nd an approximate match to a string arises in many practical problems. For example, if an optical character reader interprets a \D" as an \O", an automatic checker would need to look up the resulting word, say \eoit" in a dictionary to nd that \edit" matches it up to one substitution. Or a writer may transpose two letters at the keyboard, and the intended word, 101
worst-case run preproc. time extra space ref.

naive KMP BM Shift-or Patricia
mn
2n 2n ? m + 1
13] 6, 1] 4] 10]
O(m + j j) O(m + j j) O(n log n) O(j j) O(n)
O(n) O(m)
Figure 1: Exact Match Algorithms say \sent", should be detected instead of the error, \snet". Applications occur with strings other than text: strings of DNA base pairs, strings of musical pitch and duration, strings of edge lengths and displacements in a diagram, and so on. In addition to substitutions and transpositions, as above, errors can include insertions and deletions. The approximate match problem in strings is a development of the simpler problem of exact match: given a text, Wn, of n characters from an alphabet , and a string, Pm , of m characters, m < n, nd occurrences of P in
W . Baeza-Yates 2] reviews exact match algorithms, and we summarize in

Figure 1. 102
Here, all algorithms except the naive approach require some preprocessing. The Knuth-Morris-Pratt (KMP), Boyer-Moore (BM), and Shift-or algorithms all preprocess the search string, P , to save comparisons. The BoyerMoore algorithms are sublinear in practice, and better the bigger m is, but depend on n. The Patricia method builds a trie and is truly sublinear.1 The preprocessing is on the text, not the search strings, and although substantially greater than for the linear algorithms, need be done only once for a text. Note that tries of size n can be built in RAM in time O(n), but that on secondary storage, memory di erences make it better to use an n log n method for all practical sizes of trie. So we quote that complexity. Trie-based methods are best suited for very large texts, which require secondary storage. We emphasize them in this paper, but will compare our trie-based method experimentally with the linear methods. Approximate string matching adds a parameter to the above, k: the algorithm reports a match where the string di ers from the text by not
1
The term \sublinear" in this literature has two meanings, which we distinguish as and truly sublinear.
Truly sublinear
sublinear
in n means O(f (n)) where f is a sublinear
function, e.g., log n or 1. Sublinear means truly sublinear or O(n) where the multiplicative constant is less than 1.
103
more than k changes. A change can be a replacement (or substitution), an insertion, or a deletion. It can also be a transposition, as illustrated above. Such operations were formulated by Damerau 8] and the notion of edit distances was given by Levenshtein 15]. A dynamic programming (DP) algorithm was shown by Wagner and Fischer 26] with O(mn) worst case. Ukkonen 24] improved this to O(kn) (and clearly k
m) by nding
a cuto in the DP. Chang and Lawler 7] have the same worst case, but get sublinear expected time, O((n=m)k log m)) and only O(m) space, as opposed to O(m2) or O(n) for earlier methods. This they do by building a su x tree 27, 16], which is just a \Patricia" trie (after Morrison 19]), on the pattern as a method of detecting common substrings. Kim and Shawe-Taylor 12] propose an O(m log n) algorithm with O(n) preprocessing. They generate ngrams for the text and represent them as a trie for compactness. Baeza-Yates
and Perlberg 5] propose a counting algorithm which runs in time independent of k, O(n + R), where R is bounded O(n) and is zero if all characters in Pm are distinct. Figure 2 summarizes this discussion. Agrep 28] is a package based on related ideas, which also does limited regular expression matching,
i.e., Pm is a regular expression.
(Regular expression matching and k-approximate string matching solve 104
worst-case run preproc. time extra space ref.

DP cuto su x tree n-gram
O(mn) O(kn) O(kn) O(m log n)
O(m) O(k) O(m)
O(mn) O(kn) O(m)
26] 24] 7] 12]
Figure 2: k-Approximate Match Algorithms di erent problems. The problem areas overlap | e.g., P5 = \a#a##", where
#
is a one-place wildcard, can be written as a regular expression, but is also A recent review of these techniques is in the book by Stephen 23]. Hall
a 3-approximate match | but they do not coincide.) and Dowling 11] give an early survey of approximate match techniques. The work is all directed to searches in relatively small texts, i.e., those not too large to t into RAM. For texts that require secondary storage, O(n) is far too slow, and we need O(log n) or faster methods, as with conventional les containing separate records 17]. The price we must pay is to store an index, which must be built once for the whole text (unless the text changes). If we are interested in the text as an ordered sequence of characters, we must store 105
the text as well, and the index represents an additional storage requirement. If we are interested in the text only for the substrings it contains, as in a dictionary for spelling check, then we need only store the index, and we can often achieve compression as well as retrieval speed. Tries have been used to index very large texts 10, 18] and are the only known truly sublinear way to do so. Tries are trees in which nodes are empty but have a potential subtree for each letter of the alphabet, , encoding the data (e.g., 0 and 1 for binary tries). The data is represented not in the nodes but in the path from root to leaf. Thus all strings sharing a pre x will be represented by paths branching from a common initial path, and considerable compression can be achieved.2 Substring matching just involves nding a path, and the cost is O(m + log n) plus terms in the number of resulting matches. (The log n component re ects only the number of bits required to store pointers to the text, and is unimportant.) Regular expression matching
2
Note that this compression is on the index, which may still be larger than the text.
Typically, if we index every character in the text, as we do in Section 4, the index will be ve times the size of the text. If we index only every word, the index is smaller and compression results. 18] If we do only dictionary searches, as in Section 6, there is great compression.
106
simulates the regular expression on the trie, 9] and is also fast O(logm(n) n ) where <1. This paper proposes a k-approximate match algorithm using DamerauLevenshtein DP on a text represented as a trie. The insight is that the trie representation of the text drastically shortens the DP. A m n DP table is used to match a given Pm with the text, Wn . There would have to be a new table for each su x in W (of length n; n ? 1; : : :). But the trie representation of W compresses these su xes into overlapping paths, and the corresponding column need be evaluated only once. Furthermore, the Ukkonen cuto can be used to terminate unsuccessful searches very early, as soon as the di erences exceed k. Chang and Lawler 7] showed Ukkonen's algorithm evaluated O(k) columns, which implies searching a trie down to depth O(k). If the fanout of a trie is , the trie method needs only to evaluate O(k j jk ) DP table entries. We present this method in terms of full-text retrieval, for which both the index and the text must be stored. In applications such as spelling checkers 14], the text is a dictionary, a set of words, and need not be stored separately from the index. These are special cases of what we describe. In such cases, our method o ers negative storage overhead, by virtue of the compression, 107
in addition to the very fast performance. We compare our work experimentally with agrep 28], and show that tries outperform agrep signi cantly for small k, the number of mismatches. Since
agrep complexity is linear in k, and trie search complexity is exponential in k, agrep is expected to become better than tries for large k. Our experiments
show that the breakeven occurs beyond the practically important case of
k = 1. Since the authors of agrep compare their work thoroughly with other
approximate search techniques 28], we make no other comparisons here. This paper is organized as follows. The next section introduces DamerauLevenshtein DP for approximate string matches. Section 3 brie y describes trie data structures, and gives our new algorithm for approximate search on text tries. Then we give experimental results comparing approximate trie methods with agrep. Sections 5 and 6 discuss extensions and advanced applications of our method, including the important case of dictionary checking, where we attain both speedup and compression. We conclude and discuss further possible research.
108
2 Dynamic Programming
Let Pm = p1p2:::pm and W` = w1w2:::w` be a pattern and a target string respectively. We use D(Pm ; W`) for edit distance, the minimum number of edit operations to change Pm to W`. Here, an edit operation is either to insert wj after pi, delete pi , replace pi by wj , or to transpose two adjacent symbols in Pm . We assume symbols are drawn from a nite alphabet, . Given an example P7 = exsambl and W7 = example. We have D(P7 ; W7) = 3 since changing P7 to W7 needs to: (1) delete p3 = s, (2) replace p3 = b by
w5 = p, and (3) add w7 = e after p7 = l. The edit distance, D(Pi ; Wj ), can

be recursively de ned as follows: 8 > >0 i = j =0 > > > >1 > i< 0 or j < 0 > 0 1 > > > B D(Pi ; Wj?1 ) + 1 C < B C B C D=> B C B D(Pi?1 ; Wj ) + 1 C > B C > C else > min B B C > B > B D(Pi?1 ; Wj?1) + Sij C C > B C > B C > B C > @ A > : D(Pi?2 ; Wj?2) + Rij where8i = wj = when i; j 0 (the null character), and p 8 > > > 0 pi = wj > 1 pi?1 = wj ^ pi = wj?1 < < Sij = > , Rij = > . > 1 else > 1 else : : 109
To evaluate D(Pm ; W` ), we need to invoke D four times with both subscripts decreasing by no more than two. Thus, a brute force evaluation must take O(2min(m;`)) calls. However, for D(Pm ; W`), there are only (m+1) (`+1) possible values. DP evaluates D(Pm ; W`) by storing each possible D value in a m ` table. Table 1 shows a 3 4 DP table for P2=ab and W3=bbc.
w0 = p0 p1 p2
0 1 2
w1 = b
1
w2 = b
2
w3 = c
3 =)
D(P1 ; W1) D(P1 ; W2) D(P1 ; W3) D(P2 ; W1) D(P2 ; W2) D(P2 ; W3) b b c p0 =
0 1 2 3
p1 = a 1 1 2 3 p2 = b 2 1 1 2
Table 1: Dynamic Programming Furthermore, it is not necessary to evaluate every D values (DP table entries). Ukkonen 24] proposed an algorithm to reduce the table evaluations. His algorithm works as follows: Let Cj be the maximum i such that
D(Pi ; Wj )
k for the given j (Cj =0 if no such i). Given Cj?1, compute

110
D(Pi ; Wj ) up to i Cj?1+1, and then set Cj to the largest i (0 i Cj?1 +1)

such that D(Pi ; Wj )
k. Chang 7] proved that this algorithm evaluates
O(k2) expected entries. As shown in Table 2, for P4 =adfd and W7=acdfbdf

of 5 8=40 entries, Ukkonen's algorithm evaluates only 23 entries for k=1. Ukkonen's algorithm sets D(P1 ; W0)=1, D(P2 ; W0)=2, and C0=1 at initial time. It evaluates the rst column up to row C0+1=2. Since the largest entry value of this column is at row 2, it sets C1=2. Then, it evaluates the second column up to row C1+1=3. Since the largest entry value of this column is at at row 2, it sets C2=2. Similarly, it evaluates the third column up to row C2+1=3 to get C3=2, the fourth column to get C4=3, and the fth column to get C5=0, which indicates that it is impossible to change any pre x of adfd to acdfb in less than one edit operation. Thus, we know
D(P4 ; W7)>1. We can stop the evaluation if we do not want to know the
exact value of D(P4 ; W7).
3 Trie and Approximate Search

We follow Gonnet et al. 9] in using semi-in nite strings, or sistrings. A sistring is a su x of the text starting at some position. A text consists 111
of many sistrings. If we assume sistrings start at word boundaries, the text, \echo
enfold sample enface same example
," will have six sistrings
of this kind. Figure 3 shows these sistrings and an index trie constructed over these sistrings. To make Figure 3 simpler, we truncate sistrings after the rst blank. To index full size sistrings, we simply replace leaf nodes by sistring locations in the text. To prevent a sistring being a proper su x of another, we can append either arbitrary numbers of the null symbol after the text or a unique end-of-text symbol. The index trie has many distinctive properties: When conducting a depth- rst traverse, we not only get all sistrings, but also get them in lexicographical order. When searching a string, say example, branching decisions at each node are given by each character of the string being sought. As the trie in Figure 3, we test the rst letter e to get to the left branch, and the second letter x to get to the right branch. As a result, search time is proportional only to the length of the pattern string, and independent of the text size.
112
Text: echo enfold sample enface same example Sistrings: echo enfold sample enface same example enfold sample enface same example sample enface same example enface same example same example example e Trie: c ho f n x ample m a s
a ce
o ld le
Figure 3: Text, Sistring and Index Trie The common pre xes of all sistrings are stored only once in the trie. This gives substantial data compression, and is important when indexing very large texts. Trie methods for text can be found in 10, 18, 22]. Here we describe them only brie y. When constructing a trie over a large number of and extremely long sistrings, we have to consider the representation of a huge trie on secondary storage. Tries could be represented as trees, with pointers to subtrees, 113
as proposed by Morrison 19], who rst came up with the Patricia trie for text searches. Orenstein 21] has a very compact, pointerless representation, which uses two bits per node and which he adapted for secondary storage. Merrett and Shang 18, 22] re ned this method and made it workable for Patricia tries with one bit per node. Essentially, both pointerless representations would entail sequential searches through the trie, except that the bits are partitioned into secondary storage blocks, with trie nodes and blocks each grouped into levels such that any level of nodes is either entirely on or entirely o a level of blocks. With the addition of two integers per block, the sequential search is restricted to within the blocks, which may be searched as a tree. For more details of this representation, see 22].
3.1 Two Observations

Before introducing our approximate search algorithm, we give two observations which will link the trie method with the DP technique.
Observation I
Each trie path is a pre x shared by all sistrings in the subtrie. When evaluating DP tables for these sistrings, we will have identical columns up to the 114
pre x. Therefore, these columns need to be evaluated only once. Suppose we are searching for string sane in a trie shown in Figure 3. To calculate distances to each word, we need to evaluate six tables. Table 3 shows three of them. For each table, entries of the ith column depend only on entries of the j i th column, or the rst i letters of the target word. Words sample and same have the same pre x sam, and therefore, share the table entries up to the third column. And so does the rst column of words
echo enface enfold
and example, the rst three columns of words enface
and enfold. In general, given a path of length x, all DP entries of words in the subtrie are identical up to the xth column. This observation tells us that edit distances to each indexed word (sistring in general) can be calculated by traversing the trie, and in the meantime, storing and evaluating one DP table. Sharing of common pre xes in a trie structure saves us not only index space but also search time.
Observation II
If all entries of a column are > k, no word with the same pre x can have a distance k. Therefore, we can stop searching down the subtrie. For the last table of Table 3, all entries of the second column are > 1. 115
If searching for words with k = 1 di erences, we can stop evaluating strings in the subtrie because for sure D(sane; en:::) > 1. For the same reason, after evaluating the fourth column of table sample, we nd all entries of the column are > 1, and therefore, stop the evaluation. This observation tells us that it is not necessary to evaluate every sistring in a trie. Many subtries will be bypassed. In an extreme case, the exact search, all but one of the subtries are trimmed.
3.2 Search Algorithm

The algorithm of Figure 4 shows two functions: DFSearch( TrieRoot, 1) traverses an index trie depth- rst, and EditDist( j ) evaluates the j th column of the DP table for pattern string P and target string W . For the purpose of illustration, we start and stop evaluation at the word boundary in the following explanation. Essentially, this algorithm is a trie walker with cuto s (rejects before reaching leaves). Given a node c, its root-to-c path, w1w2:::wx, is a pre x shared by all strings in SubTrie(c). If changing w1w2:::wx to any possible pre x of P costs more than k, there will be no string in SubTrie(c) with 116
T C
:array :array
?1::max; ?1::max] of
0::max] 0::max]
of integer;
integer;
/* i; 0] = 0; j ] = i + j , ?1; ] = ; ?1] = 1 */
/* variables for Ukkonen's cuto , C 0] = k */ /* pattern and target string, W 0] = P 0] = */ /* number of allowable errors */
P,W :array k
of character;
:integer;
Procedure begin if (
DFSearch( TrieNode
:Anode,
Level
:integer);
/* depth- rst trie search */
TrieNode in a leaf node) Level]

:=
then do
for W
each character in the node
/* retrieve characters one by one */ /* nd a target word */
the retrieved character;

= ' ') then
if (W
Level]
output W 1]W 2]...W j-1]; return; if (
EditDist( Level)
:=
) then
/* more than k mistakes */
return;
Level
else for
Level
+ 1;
each child node

:=
do
/* retrieve child node one by one */
ChildNode
W
117 the retrieved node; /* nd a target word */
Level]
:=
the retrieved character;

= ' ') then
if (W
Level]
output W 1]W 2]...W j-1];
k mismatches. Hence, there is no need to walk down Subtrie(c). A cuto

occurs. Each letter wj (1 j x) on the path will cause a call to EditDist(j ). We use Ukkonen's algorithm to minimize row evaluations. Suppose we have a misspelled word P =exsample and want all words with
k=1 mismatches. Figure 5 shows the index trie and some intermediate results
of the search. After evaluating D(P; ech), we nd that entries on the third column are all 2. According to observation II, no word W with the pre x
ech
can have D(P; W )
1. We reject word echo and continue traversing.
After evaluating D(P; enf), we know, once again, no word W with pre x enf can have D(P; W ) 1, and therefore, there is no need to walk down this subtrie. We cut o the subtrie. Since ech and enf share the same pre x e, we copy the rst column of ech when evaluating enf (observation I). After evaluating path 3, we nd D(P; example) = 1 and accept the word. The search stops after cutting at path 4, sa. Figure 5 shows some intermediate results of the search.
118
1
ho
3
ample
4
m
2
o ld le p
a ce
Pattern String: Depth First Search Path 1: Search Path 2: Search Path 3: Search Path 4:
exsample String ech enf example sa
k1 Distance 2 2 =1 2 Action reject cutoff accept cutoff
Figure 5: Approximate Trie Search Example
119
4 Experimental Results
We built tries for ve texts: (1) The King James' Bible retrieved from akbar.cac.washington.edu, (2) Shakespeare's complete works provided by Oxford University Press for NeXT Inc., (3) section one of UNIX manual pages from Solbourne Computer Inc., (4) C source programs selected randomly from a departmental teaching machine, and (5) randomly selected ftp le
names provided by Bunyip Information System. Sistrings start at any char-
acter except the word boundary, such as blank and tab characters. Table 4 shows the sizes of the ve texts and their index tries.
4.1 Search Time

We randomly picked up 5 substrings from each of the ve texts, and then searched for the substrings using both agrep 28] and our trie algorithm. Both elapsed time and CPU time are measured on two 25MHz NeXT machines, one with 28MB RAM and the other with 8MB RAM. Table 5 shows measured times, averaged on the ve substrings, in seconds. The testing results show that our trie search algorithm signi cantly outperforms agrep in exact match and approximate match with one error. For 120
the exact match, trie methods usually give search time proportional only to the length of the search string. Our measurements show that trie search times for exact match do not directly relate to the text size. It requires few data transfers (only one search path), and therefore, is insensitive to the RAM size. Let (k) be the average trie search depth. It is the average number of columns to be evaluated before assuring that D(P; W ) > k. It has been proven that (k) > k if k is less than the target string length, and (k) =
O(k) 24, 7]. For a complete trie, the worst case of a text trie, the trie search
algorithm can nd all substrings with k mismatches in O(k j jk ) expected time: there are j jk paths up to depth k, and each column of the DP table has k rows. The time is independent of the trie size. In fact the trie algorithm is better than the agrep for small k, but not for large k, because agrep scans text linearly but the trie grows exponentially. For our measured texts, which are relatively small, the trie search brings more data into RAM than agrep when k 2, When RAM size is larger than data size, measured CPU times are closer to the elapsed times. Since each query is tested repeatedly, most of data (text and trie) are cached in RAM, and therefore, the searches are CPU-bound. 121
However, for a smaller RAM size (or larger text data), the searches have to wait for data to be transferred from secondary storage. Since agrep scans the entire text, its search time is linearly proportional to the text size. File names are di erent from the other tested texts. File names are all pairwise distinct. Any two substrings resemble each other less, which helps
agrep to stop evaluation more quickly. This does not help the trie search
because it makes the trie shallow (toward a complete trie) and takes more time to scan the top trie levels.
5 Extensions
Our trie search algorithm can be extended in various ways. For example, spelling checkers are more likely to ask for the best matches, rather than the words with a xed number of errors. The optical character recognizers may search for words with substitutions only. When searching for telephone numbers, license numbers, postal codes, etc., users require not only penalties for certain types of edit operations, but also a combination of the exact search and the approximate search because they often remember some numbers for sure. In text searching, patterns are more often expressed in terms of regular 122
expressions. Extensions described in this section (except Section 5.5) have been discussed in 28]. We present them here using DP.
5.1 Best Match

In some applications, we do not know the exact number of errors before a search. We want strings with the minimal number of mismatches, i.e., strings with 0 k mismatches and no other string in the text having k0<k mismatches. To use our algorithm, we de ne a preset k, which is a small number but no less than the minimal distance, i.e., there exists a string, s, in the text such that D(pattern; s)
k. A simple method to set k is to let s be an
arbitrary string in the text, and then set k = D(pattern; s). A better way is to search for the pattern using deletions (or insertions, or substitutions) only. This is to traverse the trie by following the pattern string. Whenever no subtrie corresponds to a character of the pattern, we skip the character in the pattern and look for a subtrie for the next character, and so on. The number of skipped characters will be used as an initial k. During the traverse, we will have k0 = D(pattern,s) for a leaf node, where 123
s is the path from the root to the leaf node. Whenever we have k > k0, we set k = k0 and clear the strings that have been found. For best match searching, k decreases monotonically.
5.2 Weighted Costs

The distances evaluated before are assumed to have cost 1 for any edit operation. Sometimes, we may want to have a di erent cost. For example, to have substitution costs at least the same as one deletion and one insertion, or to disallow deletions completely. To make edit operations cost di erently, we need only to modify the distance function. Let I , D, S and R be the costs of an insertion, a deletion, a substitution, and a transposition respectively. We assume costs are all
> 0. To disallow an operation, say insertions, we set I = 1. As before, D(P0 ; W0) = 0 and D(Pi ; Wj ) = 1 if i or j < 0. Otherwise, we rede ne D(Pi ; Wj ) as follows:
124
0 B D(Pi ; Wj?1) + Iij B B B B D(Pi?1 ; Wj ) + Dij B D(Pi ; Wj ) = min B B B B D(Pi?1 ; Wj?1) + Sij B B B @ D(Pi?2 ; Wj?2) + Rij Here I8 = I , Dij = D, and ij 8 > > > R pi?1 = wj ^ pi = wj?1 > 0 pi = wj < < . , Rij = > Sij = > > 1 else > S else : :
1 C C C C C C C C C C C C C A
Furthermore, we may add a cost, C , for changing the case. For example, for case insensitive searches, we set C = 0, and for case sensitive searches, we set C = 1. We may even disallow case changes by setting C = 1. Let a ' b be a = b without checking the case di erence, and let a b mean that a and 8 > > 0 pi w j < , and replace: b are of the same case. Now, we de ne, Cij = > > C else : 8 > > 0 pi ' w j < Sij = Cij + > ; > S else : 8 > > R pi?1 ' wj ^ pi ' wj?1 < : Rij = Ci?1;j + Ci;j?1 + > > 1 else : The concept of changing cases can be extended even more generally. For example, when searching a white page for telephone numbers, we don't want 125
an apartment number, such as 304B, to be recognized as a telephone number, i.e., do not replace a character unless it is a digit to a digit. For the same reason, we may not want to mix letters, digits and punctuation with each other when searching for license plates, such as RMP-167, or postal codes, such as H3A here.
2A7
. For those applications, we can use above de nitions for
Sij and Rij , but give a new interpretation of C . We will not elaborate them
5.3 Combining Exact and Approximate Searches

We sometimes know in advance that only certain parts of the pattern may have errors. For example, many spelling checkers may give no suggestions for garantee. But suppose we knew the su x rantee was spelled right. In this case, we want to search part of the pattern exactly. By following agrep standards 28], we denote this pattern as ga<rantee>. Characters inside a
<>
cannot be edited using any one of the four operations. To support both exact and approximate searches for the same pattern,
we need only modify Iij , Dij , Cij , Sij and Rij . Let function j pi be a predicate that determines whether pi is a member character inside an exact match <>. 126
Let function ? pi be a predicate that tells whether pi is the last character inside a <>. The new de nitions are: 8 8 > > > 1 j pi ^ 6? pi > 1 j pi < < Iij = ; Dij = > ; > I else > D else : : 8 > > 1 j pi ^ pi 6= wj > > < Cij = > C pi 6 wj _ pi ' wj ^ pi 6= wj ; > > > 0 else : 8 > > 0 p i ' wj < Sij = Cij + ; > S else : 8 > >R P < Rij = Ci?1;j + Ci;j?1 + > ; > : 1 else where P = (pi?1 ' wj ^ pi ' wj?1) ^ 6 j pi?1 ^ 6 j pi . By above de nitions, string guarantees also matches ga<rantee> with two insertions. To disallow insertions at the end of an exact match, we introduce an anchor symbol, $ (borrowed from Unix standards). Pattern
ga<rantee>$
means that target strings must have the su x rantee. What
needs to be changed is to set ? pi false when there is a $ symbol followed pi, i.e., a pattern looks like : : :<: : :pi >$. In a similar way, we introduce another anchor symbol, ^, to prevent insertions at the beginning of an exact match. 127
For example, ^<g>a<rantee>$ means that target strings must start with the letter g and ended with the su x rantee. This time, we set j p0 true.
5.4 Approximate Regular Expression Search

The ability to match regular expressions with errors is important in practice. Regular expression matching and k-approximate string matching solve di erent problems. They may overlap but do not coincide. For example, the regular expression a#c, where # is a one-place wildcard, can be written as a 1-approximate match with substitutions and insertions on the second character only. Baeza-Yates 5] proposed an search algorithm for the full regular expression on tries. In this section, we will extend our trie algorithm to deal with regular expression operators with errors. However, the extension operators work only for single characters, i.e., there is no group operator. For example, we may search for a*b with mismatches, but not (ab)*. Searching tries for the full regular expression with approximation is an open problem.
128
5.4.1 Alternative Operator

Suppose we want to nd all postal codes, H3A or 7. First, we introduce the notation, set of alternative characters. Thus, H3A exactly; while H3A
2A4 137] 2A?
, where ? is either 1, 3,
]
(once again, borrowed from de nes a

2A 137]
Unix standard), to describe either 1, 3, or 7. Formally, operator

2A7
matches pattern H3A
matches the pattern with one mistake.
Substituting one character with a set of allowable characters can be easily achieved by rede ning the = and ' operators of Section 2 and Section 5.2 respectively. For pattern P7 =H3A
2A 137]
, we have p1 =H, p2 =3, ..., and
p7 =
137]
. We de ne p7 = wj as either 1= wj , or 3= wj , or 7= wj . In other
]
words, if pi is a set of allowable characters, pi = wj means wj matches one of the characters de ned by the of =. As syntactic sugar (Unix standards), we may denote case letters, i.e., a range of characters; wild card.
^aeiou] a-z]
operator. ' is the case insensitive version for all lower
for anything but vowels,
i.e., a complement of the listed characters; and . for all characters, i.e., the
129
5.4.2 Kleen Star

The kleen star allows its associated characters to be deleted for free, or to be replaced by more than one identical character for free. For example, ac,
abc abbc
and abbbc all match pattern ab*c exactly. a
0-9]*c
means that
an unbounded number of digits can appear between a and c. Let function pi be a predicate which says there is a Kleen star associated with the pattern character pi. To support the Kleen star operator, we need only to change Iij and Dij . Remember, pi means that we can delete pi at no cost, and insert any number of w = pi after pi at no cost. We now give the new de nition as follows: 8 > >1 > > < Iij = > Cij > > >I : 8 > >1 > > < Dij = > 0 > > >D :
6 pi ^ j pi ^ 6? pi
pi ^ p i ' w j
else
6 p i ^ j pi
pi
else
130
5.5 Counter
Our algorithm can also be extended to provide counters. Unlike a Kleen star, e.g., ab*c, which means that unbounded number of bs can appear between
a
and c, pattern ab?c says that only ac and abc match exactly. If we want
these strings abbc, abbbc, abbbbc and abbbbbc, i.e., two to ve bs between a and c, we can write the pattern as abbb?b?b?c, or abf2,5gc (Unix syntax). To support counters, we need only to modify Dij since p? means character
p
can deleted for free. Let us de ne a function ?pi which says there is
a counter symbol, ?, associated with the pattern character pi . The new 8 de nition is: > > 1 6 ?pi ^ 6 pi ^ j pi > > < Dij = > 0 ?pi _ pi : > > > D else :
6 Dictionary Search
By a dictionary, we mean a text le which contains keywords only, i.e., a set of strings that are pairwise distinguishable. For dictionary searches, we are only interested in those keywords that relate to the pattern by some measurements (in our case, the edit distance). The orders (or locations) of 131
those keywords are not important to us. For such applications, the text le can be stored entirely in a trie structure. The trie in Figure 3 is a dictionary trie. Experimental results in 22] show that dictionary trie sizes are about 50% of the le sizes for English words. In other words, we are providing not only an algorithm for both exact and approximate searches, but also a data structure for compressing the data up to 50%. Searches are done on the structure without decompression operations. Searching soundex codes 20] is an example of the dictionary search. By replacing English words with their soundex codes and storing the codes in the dictionary trie, we are able not only to search any given soundex code e ciently (exact trie search) but also to reduce the soundex code size by half. Searching an inverted le is another example of dictionary search. An inverted le is a sorted list of keywords in a text. The trie structure keeps the order of its keys. By storing keywords in the dictionary trie, we can either search for the keywords or for their location. Furthermore, our trie algorithm provides search methods for various patterns with or without mismatches.
132
7 Conclusion
Tries have been used to search for exact matches for a long time. In this paper, we have expanded trie methods to solve the k approximate string matching problem. Our approximate search algorithm nds candidate words with k di erences in a very large set of n words in O(k j jk ) expected worst time. The search time is independent of n. No other algorithm which achieves this time complexity is known. Our algorithm searches a trie depth rst with shortcuts. The smaller k is, the more subtries will be cut o . When k = 0, all irrelevant subtries are cut o , and this gives the exact string search in time proportional only to the length of the string being sought. The algorithm can also be used to search full regular expressions 3]. We have proposed a trie structure which uses two bits per node and has no pointers. Our trie structure is designed for storing very large sets of word strings on secondary storage. The trie is partitioned by pages and neighboring nodes, such as parents, children and siblings, are clustered in terms of pages. Pages are organized in a tree like structure and are searched in time logarithmic the le size. 133
Our trie method outperforms agrep, as our results show, by an order of magnitude for k=0, and by a factor of 4 for k=1. Only when k 2 does the linear worst case performance of agrep begin to beat the trie method for the moderately large documents measured.
8 Future Work
Spelling checkers based on searching minimal edit distance performs excellently for typographic errors and for some phonetic errors. For example,
exsample
to example has one di erence, but sinary to scenery has three
di erences. To deal with phonetic misspellings, we may follow Veronis's work 25] by giving weights to edit operations based on phonetic similarity, or using non-integer distances to obtain ner grained scores on both typographic and phonetic similarities. Another solution is to follow the convention which assumes no mistakes in the rst two letters, or gives higher penalty for the rst few mistakes. Excluding the rst few errors allows us to bypass many subtries near the trie root. This not only gives quicker search time, but also reduces the number of possible candidates. With a small set of candidate words, we can impose a linear phonetic check. 134
Even with one di erence, a short word, say of 2 letters, matches many English words. There are more short words than long words. This type of error is di cult to correct out of context.
Acknowledgments
This work was supported by the Canadian Networks of Centres of Excellence (NCE) through the Institute of Robotics and Intelligent Systems (IRIS) under projects B-3 and IC-2, and by the Natural Sciences and Engineering Research Council of Canada under grant NSERC OGP0004365.
References
1] A. Apostolico. The myriad virtues of su x trees. In Combinatorial
Algorithms on Words, pages 85{96. Springer-Verlay, 1985.
2] R.A. Baeza-Yates. String searching algorithms. In W.B. Frakes and R.A. Baeza-Yates, editors, Information Retrieval: Data Structures and
Algorithms, pages 219{40. Prentice-Hall, 1992.
135
3] R.A. Baeza-Yates and G.H. Gonnet. E cient text searching of regular expressions. In G. Ausiello, M. Dezani-Ciancaglini, and S.R.D. Rocca, editors, Proceedings of 16th International Colloquium on Automata, Languages and Programming, LNCS 372, pages 46{62, Stresa, Italy, July
1989. Springer-Verlag. 4] R.A. Baeza-Yates and G.H. Gonnet. A new approach to text searching.
Communications of the ACM, 35(10):74{82, 1992.
5] R.A. Baeza-Yates and C.H. Perleberg. Fast and practical approximate string matching. In G. Goos and J. Hartmanis, editors, Proceedings of
3rd Annual Symposium on Combinatorial Pattern Matching, LNCS 644,
pages 185{92, Tucson, Arizona, April 1992. Springer-Verlag. 6] R.S. Boyer and J.S. Moore. A fast string searching algorithm. Communications of the ACM, 20(10):762{72, 1977.
7] W.I. Chang and E.L. Lawler. Approximate string matching in sublinearexpected time. In 31st Annual Symposium on Foundations of Computer
Science, pages 116{24, St. Louis, Missouri, October 1990. IEEE Com-
puter Society Press. 136
8] F.J. Damerau. A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3):171{6, 1964. 9] G.H. Gonnet. E cient searching of text and pictures. Technical Report OED-88-02, Centre for the New OED., University of Waterloo, 1988. 10] G.H. Gonnet, R.A. Baeza-Yates, and T. Snider. New indices for text: PAT trees and PAT arrays. In W.B. Frakes and R.A. Baeza-Yates, editors, Information Retrieval: Data Structures and Algorithms, pages 66{ 82. Prentice-Hall, 1992. 11] P.A.V. Hall and G.R. Dowling. Approximate string matching. Computing Surveys, 12(4):381{402, 1980.
12] J.Y. Kim and J. Shawe-Taylor. An approximate string-matching algorithm. Theoretical Computer Science, 92:107{17, 1992. 13] D.E. Knuth, J.H. Morris, and V.R. Pratt. Fast pattern matching in strings. Computer Journal, 6(2):323{50, 1977. 14] K. Kukich. Techniques for automatically correcting words in text. Computing Surveys, 24(4):377{439, 1992.
137
15] V. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Dokl., 6:126{36, 1966. 16] E.M. McCreight. A space economical su x tree construction algorithm.
Journal of the ACM, 23(2):262{72, 1976.
17] T.H. Merrett. Relational Information Systems. Reston Publishing Co., Reston, VA, 1983. 18] T.H. Merrett and H. Shang. Trie methods for representing text. In
Proceedings of 4th International Conference, FODO'93, LNCS 730, pages
130{45, Chicago, Ill, October 1993. Springer-Verlag. 19] D.R. Morrison. PATRICIA { Practical Algorithm To Retrieve Information Coded In Alphanumeric. Journal of the ACM, 15(4):514{34, 1968. 20] M.K. Odell and R.C. Russell. U.S. Patent Numbers, 1,261,167 (1918) and 1,435,663, 1922. U.S. Patent O ce, Washington, D.C. 21] J.A. Orenstein. Multidimensional tries used for associative searching.
Information Processing Letters, 14(4):150{6, 1982.
138
22] H. Shang. Trie Methods for Text and Spatial Data on Secondary Storage. PhD Dissertation, School of Computer Science, McGill University,
November 1994. 23] G.A. Stephen. String Searching Algorithms. Lecture Notes on Computing. World Scienti c Pub., 1994. 24] E. Ukkonen. Finding approximate patterns in strings. Journal of Algorithms, 6:132{7, 1985.
25] J. Veronis. Computerized correction of phonographic errors. Comput.

Hum., 22:43{56, 1988.
26] R.A. Wagner and M.J. Fischer. The string-to-string correction problem.
Journal of the ACM, 21(1):168{78, 1974.
27] P. Weiner. Linear pattern matching algorithms. In IEEE Symposium

on Switching and Automata Theory, pages 1{11, 1973.
28] S. Wu and U. Manber. Fast text searching. Communications of the

ACM, 35:83{91, 1992.
Heping Shang received a B.S. degree in computer engineering from Changsha Institute of Technology, Changsha, Hunan, China, in 1981, an M.S. 139
degree in computer science from Concordia University, Montreal, Quebec, Canada in 1988, and a Ph.D degree in computer science from McGill University, Montreal, Quebec, Canada in 1995. His research interests include data structures and searching techniques for very large textual and spatial database data, database programming languages, parallel processing and concurrency control. Dr. Shang is now at Replication Server Engineering, Sybase Inc., Emeryville, California, USA. T. H. Merrett received a B.Sc. in mathematics and physics from Queen's University at Kingston, Ontario, Canada (1964) and a D.Phil. in theoretical physics from Oxford University (1968). After two years with IBM (U.K.), he joined the School of Computer Science at McGill University, where he is a professor. His research interests are in database programming languages and data structures and algorithms for secondary storage. Dr. Merrett initated and directs the Aldat Project at McGill University, which has been responsible for data structures for multidimensional data, such as multipaging and Z-order, and for trie-based structures for text and spatial data. The database programming language contributions of the Aldat 140
Project have included the domain algebra; quanti ed tuple (QT)-selectors; relational mechanisms for multidatabases, metadata, and inheritance; methods for process synchronization and nondeterminism; and the computation mechanism, which uni es relations, functions, and aspects of constraint programming.
141
a c d f b d f
0 1 2 3 4 5 6 7
a 1 0 1 2 3 4 5 6 d 2 1 1 1 2 3 4 5 f 3 2 2 2 1 2 3 4 d 4 3 3 2 2 2 2 3 a
0 1 0 1 =)
c
2 1 1 2
d
3 2 1 2
f
4 3 2 1
b
5 4 3 2 2
a 1 d 2 f d
C0 C1 C2 C3 C4 C5 C6 C7
Table 2: Ukkonen's Cuto
142
s a m p l e
0 1 2 3 4 5 6 0 1 2 3 4 5 1 0 1 2 3 4 2 1 1 2 3 4 3 2 2 2 3 3
s a n e
1 2 3 4
column: 1 2 3 4 5 6
s a m e
0 1 2 3 4
e n f
0 1 2 3
s 1 0 1 2 3 s 1 1 2 a 2 1 0 1 2 a 2 2 2 n 3 2 1 1 2 n 3 3 2 e 4 2 2 2 1 e 4 3 3
1 2 3 4 1 2 3 Table 3: Dynamic Programming Tables
143
Text Bible Shakespeare Unix Manual C Program File Names
Text Size 4.5MB 6.4MB 7.6MB 8.4MB 8.4MB
#Sistrings 3.4 M 4.4 M 4.8 M 5.3 M 6.6 M
#Trie Nodes 46.0 M 47.9 M 107.0 M 157.0 M 76.9 M
Table 4: Text File and Index Trie
144
NeXT with 28MB RAM Text Bible Shakespeare elapsed (CPU), sec: agrep 4.45 (4.32) 7.90 (7.76) 7.53 (7.43) 12.63 (12.50) 5.80 (5.68) 7.48 (7.37) 13.52 (13.37) 28.48 (28.29) 22.18 (21.93) 9.10 (8.86) trie 0.68 (0.43) 0.63 (0.41) 1.07 (0.58) 0.68 (0.35) 0.53 (0.38) 2.78 (2.67) 2.78 (2.67) 4.42 (4.32) 4.63 (4.49) 7.17 (7.05)
NeXT with 8MB RAM elapsed (CPU), sec: agrep 5.98 (4.57) 17.50 (9.53) 18.72 (9.51) 24.13 (14.62) 16.82 (7.43) 8.58 (7.48) 23.53 (15.16) 39.58 (30.20) 34.08 (23.95) 21.07 (11.34) trie 0.82 (0.43) 0.90 (0.42) 1.37 (0.58) 0.85 (0.37) 0.75 (0.37) 2.77 (2.55) 8.42 (8.20) 4.15 (3.90) 8.25 (4.68) 13.63 (7.48) 40.12 (24.32) 66.18 (32.93) 80.12 (44.17)
k = 0 Unix Manual
C Program File Names Bible Shakespeare
k = 1 Unix Manual
C Program File Names Bible Shakespeare
13.53 (13.21) 22.52 (22.19) 16.42 (13.51) 24.83 (24.50) 28.57 (28.16) 33.90 (26.40) 46.50 (45.87) 41.63 (40.91) 57.22 (47.58)
k = 2 Unix Manual
C Program File Names
35.83 (35.40) 62.87 (61.40) 47.87 (37.41) 138.75 (67.59) 14.22 (13.77) 98.00 (97.41) 36.40 (16.42) 176.53 (99.20)
Table 5: Approximate Search Time 145

Shang Tries For Approximate String Matching

Uploaded by

Copyright:

Available Formats

Shang Tries For Approximate String Matching

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Shang Tries For Approximate String Matching

Uploaded by

Copyright:

Available Formats

Tries for Approximate String Matching

H. Shang and T.H. Merrett September 8, 1995

worst-case run preproc. time extra space ref.

O(m + j j) O(m + j j) O(n log n) O(j j) O(n)

W . Baeza-Yates 2] reviews exact match algorithms, and we summarize in

in n means O(f (n)) where f is a sublinear

(Regular expression matching and k-approximate string matching solve 104

worst-case run preproc. time extra space ref.

O(mn) O(kn) O(kn) O(m log n)

O(m) O(k) O(m)

O(mn) O(kn) O(m)

26] 24] 7] 12]

w5 = p, and (3) add w7 = e after p7 = l. The edit distance, D(Pi ; Wj ), can

k for the given j (Cj =0 if no such i). Given Cj?1, compute

D(Pi ; Wj ) up to i Cj?1+1, and then set Cj to the largest i (0 i Cj?1 +1)

k. Chang 7] proved that this algorithm evaluates

O(k2) expected entries. As shown in Table 2, for P4 =adfd and W7=acdfbdf

3 Trie and Approximate Search

," will have six sistrings

3.1 Two Observations

and example, the rst three columns of words enface

3.2 Search Algorithm

/* depth- rst trie search */

TrieNode in a leaf node) Level]

each character in the node

/* retrieve characters one by one */ /* nd a target word */

the retrieved character;

output W 1]W 2]...W j-1]; return; if (

/* more than k mistakes */

each child node

/* retrieve child node one by one */

117 the retrieved node; /* nd a target word */

the retrieved character;

output W 1]W 2]...W j-1];

k mismatches. Hence, there is no need to walk down Subtrie(c). A cuto

can have D(P; W )

1. We reject word echo and continue traversing.

exsample String ech enf example sa

k1 Distance 2 2 =1 2 Action reject cutoff accept cutoff

Figure 5: Approximate Trie Search Example

4.1 Search Time

5.1 Best Match

k. A simple method to set k is to let s be an

5.2 Weighted Costs

. For those applications, we can use above de nitions for

5.3 Combining Exact and Approximate Searches

means that target strings must have the su x rantee. What

5.4 Approximate Regular Expression Search

5.4.1 Alternative Operator

(once again, borrowed from de nes a

Unix standard), to describe either 1, 3, or 7. Formally, operator

matches pattern H3A

matches the pattern with one mistake.

, we have p1 =H, p2 =3, ..., and

operator. ' is the case insensitive version for all lower

for anything but vowels,

5.4.2 Kleen Star

and abbbc all match pattern ab*c exactly. a

to example has one di erence, but sinary to scenery has three

/* retrieve characters one by one / / nd a target word */