HMM Based On-Line Handwriting Recognition
HMM Based On-Line Handwriting Recognition
net/publication/220182032
CITATIONS READS
259 1,115
3 authors, including:
Jianying Hu
IBM
177 PUBLICATIONS 4,062 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jianying Hu on 17 September 2014.
Input Fraluic OG
achieved a writer independent recognition rate of 94.5% on 3,823
unconstrained handwritten word samples from 18 writers covering a 32
word vocabulary.
G r m m a r Index
~
-
Proccablng
stale nurauons
Index Terms-On-line handwriting recognition, hidden Markov
models, subcharacter models, evolutional grammar, invariant features,
segmental features.
Fig. 1. Partial diagram of AEGIS.
0162-8828/96$05.0001996 IEEE
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on December 11, 2008 at 16:44 from IEEE Xplore. Restrictions apply.
1040 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 18, NO. 10, OCTOBER 1996
most popular are the so-called left-to-right HMMs in which ui = 0 figure, there are six, eight, and five strokes in "a," "g," and "j"
for j .:i. The subcharacter and character models we have adopted are (without the dot), respectively. "a" and "g" share the first 4 strokes
even more restrictive disallowing in addition state skipping (uli = 0 SI, ,534, s 5 , s6; "g" and "j" share the last four strokes s18,
for j > i + 1).We have selected this relatively simple topology be- s19, s20, sl;stroke sl (corresponding to upward ligature) is
cause it has been shown to be successful in speech recognition, and shared by all three samples. The training procedure will be dis-
there has not been sufficient proof that more complex topologies cussed in more detail later in Section 2.4.
would necessarily lead to better recognition performance. Fur- Ligatures are attached to letters only during training. At recog-
&T$
thermore, in unconstrained handwriting, skipping of segments nition time they are treated as special, one stroke "connecting"
seem to happen most often for ligatures, which in our system is letters inserted between "core" letters and can be skipped with no
handled by treating ligatures as special "connecting" letters and penalty (see Fig. 3). This treatment insures that our system can
allowing them to be bypassed in the language model. handle mixed style handwriting as opposed to pure cursive only.
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on December 11, 2008 at 16:44 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 18, NO. IO, OCTOBER 1996 1041
structure of the HMM for each letter pattern. The EG initially 6,(t)= maxS,_,(t- I), 6,(t - 1) + ~ ~ ( 0 , )
contains a degenerate grammar consisting of only a start node, an
end node and a single arc with a non-terminal label. During the We now explain how scores are propagated through grammar
evolution process, upon encountering a non-terminal arc, the arc is nodes. Suppose g is a grammar node and p(g) and s(g) denote the
first replaced with the subnetwork it represents and is examined sets of preceding and succeeding letter pattern classes corre-
again. This process continues until all nonterminal references on sponding to the incoming and outgoing arcs respectively. For each
the earliest arcs are eliminated. If a resulting label references an letter pattern class I, m(I) denotes the HMM used to model the
HMM, the appropriate model structure is built as indicated by the pattern; h(l) denotes the initial state of the model; and f(I) denotes
lexicon. Once all leading HMM references are built, HMM score the final state of the model. At each sample point t during the
integration proceeds. As a score emerges from an HMM and needs Viterbi search, the maximum of all the accumulated scores at the
to be propagated further in the network, additional evolution of final states of the preceding letter models, also called incoming
the EG may occur. In this way, only those regions of the grammar scores, is found and propagated to the initial state of each of the
touched by the HMM search are expanded. Beam search methods succeeding models, along with the corresponding state sequence.
[14] can be used to limit the amount of grammar expansion. The operation is carried out as follows:
Fig. 3 shows part of the evolutional grammar for a dictionary k = argmaxSf(,)(t- 1); (3)
containing the word "can" and how it evolves during decoding. Wg)
Labels in brackets are nonterminal labels and the others are termi- and for each statej = h(I); E s @:
nal labels. Labels composed of a single letter followed by different
digit subscripts refer to the separate HMMs for different patterns
of the same letter; "lgl" refers to the upward ligature model. Un-
labeled arcs are the null-arcs. For any given dictionary, a grammar
compiler [15] is used to convert a list of word specificationsinto an
optimized network with shared prefixes and suffixes. (5)
<Start>
0 =Q
2.4 Model Training
I
<a>
lg 1
Models are trained using the well known iterative segmental
training method ill]. Given a set of training samples, the HMM
for each sample is instantiated by concatenating HMMs for the
appropriate letters, ligatures and delayed strokes, which are in
turn instantiated by concatenating the composing stroke models.
The training procedure then is carried out through iterations of
segmentation of training samples by Viterbi algorithm using the
current model parameters, followed by parameter reestimation
using the means along the path. The iterative procedure stops
when the difference between the likelihood scores of the current
iteration and those of the previous one is smaller then a preset
threshold.
We have developed a training process composed of three con-
secutive stages. The initial parameters for each stage are the out-
put from the previous stage, while the initial parameters for the
first stage are obtained through equal-length segmentation of the
a
training samples. No manual segmentation is involved at any
Fig. 3. Partial diagrams of a EG network. stage.
The first stage-/etfer training, is carried out on isolated letter
2.3 Decoding samples including ligatures and delayed strokes. This stage essen-
The Viterbi algorithm is used to search for the most likely state tially serves as a model initializer-it is left to the later training
sequence corresponding to the given observation sequence and to stages to fully capture the characteristics of cursive handwriting
give the accumulated likelihood score along this best path [111. and variations among different writers. The model parameters
Suppose that for any state t , q,(t) denotes the selected state se- obtained are then passed on as initial parameters for the second
quence (hypothesis) leading to i at sample point t, and G(t) denotes stage of training-linear word training, which is carried out on
the accumulated log-likelihood score of that hypothesis. 0, repre- whole word samples. We call it linear because during this stage
sents the observation at sample point t, and A,(OJ represents the each word sample is bound to a single sequence of stroke models.
log-likelihood score of 0, in state i. In our current model, for effi- In other words, each sample is labeled not only by the corre-
ciency reasons, we assume that all the state-preserving probabili- sponding word, but also by the exact letter pattern sequence corre-
ties a,, are constant and therefore need not be included in the ac- sponding to the particular style of that sample, which is then con-
cumulated likelihood scores. (We have experimented with variable verted to a unique stroke sequence according to the lexicon. Such
state preserving probabilities and they showed no significant im- highly constrained training is necessary to obtain reliable results
provement in recognition performance over the constant ones.) when the models are not yet adequately trained. The disadvantage
Since each letter model is a left-to-right HMM with no state skip- is that the letter pattern sequence corresponding to each sample
ping, within each letter model the hypothesis and its likelihood are has to be manually composed, which is a demanding and error
updated as: prone process, especially when the training set is large.
This is the reason why we introduce the third training stage-
lattice word training. As the name suggests, during this stage each
word is represented by a lattice, or finite state network, that in-
cludes all possible stroke sequences that can be used to model the
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on December 11, 2008 at 16:44 from IEEE Xplore. Restrictions apply.
1042 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 18, NO. 10, OCTOBER 1996
3 INVARIANT
FEATURES (7)
In choosing handwriting features, we face the problem of variabil-
ity in handwriting caused by the geometric distortion of letters If we fix the value of 8 to a constant 8,, then (7) defines another
and words by rotation, scaling and translation. In general, transla- invariant feature which can be computed at each sample point. We
tion is not a serious problem because it is easy to chose features call this feature ratio of tangents. In order to enhance the distinctive
that are invariant with respect to translation. Examples include power of the feature, we augment it by the sign of the local cur-
handwriting stroke tangents and curvature. Unfortunately, these vature. The resulted feature is called signed ratio of tangents, and is
features are not invariant with respect to the other two factors. For used instead of ratio of tangents in the experiments described
example, stroke tangents are invariant with respect to scale, but later.
not rotation; curvature is invariant with respect to rotation, but not
scale.
p"'
There are two principal methods for dealing with variability in
pattern recognition. The patterns can be normalized before feature
extraction by some set of preprocessing transformations, or fea-
tures can be chosen to be insensitive to the undesirable variability.
These two methods often need to be combined because neither one
can solve the problem completely by itself. On one hand, excessive
preprocessing is undesirable because it may result in premature, U,
limiting decisions or loss of information. On the other hand, cer-
tain features that are not completely invariant (e.g., tangent slope
angle) prove to be important in distinguishing different symbols.
We have adapted one of the common features for handwriting Fig. 5. Ratio of tangents
recognition-tangent slope angle, which is invariant to translation
and scaling, but not rotation. In this section, we introduce two new To evaluate accurately the invariant features described above,
features for handwriting recognition that are invariant with re- high quality derivative estimates of up to the third order have to
spect to all three factors of geometric distortion. be obtained from the sample points. Obviously simple finite dif-
We define a similitude transformation to be a combination of ference based methods for derivative estimation do not provide
translation, rotation and scaling described by: the needed insensitivity to spatial quantization error or noise. We
have applied a set of smoothing spline operators of up to the fifth
order for this purpose 1101.
With the addition of the two invariant features the observation
where c is a positive scalar. Two curve segments are equivalent if
vector now contains three dimensions. Even though these vectors
they can be obtained from each other through a similitude trans-
are continuous in nature, we chose to use discrete HMMs instead
formation. Invariant features are features that have the same value
of continuous density H M M s to avoid making assumptions on
at corresponding points on different equivalent curve segments.
the form of the underlying distribution. To simplify our models,
Suppose that a smooth planar curve P(t) = (x(t), y(t)) is mapped
we also chose to treat the features as being independent from each
into F(F) = (x"(t),y"(t))
by a reparameterization t(F) and a simili- other. A separate distribution is estimated for each feature in each
tude transformation,i.e., e(F)
= c U P(t(?)) + v . Without loss of gen- state during training. The joint probability of observing symbol
erality, assume that both curves are parameterized by arc length vector S, = {kl,k2,k3]instatejis:
1 2 3
(natural parameter), i.e., t = s and F = 5. Obviously, dS = c d s , thus
the corresponding points on the two curves are related by bI(silir2k3) = f,=I
ibp(ki)r
z,)
F(S) = cUP((S- / c) + v . It can then be shown [161 that curva-
ture (the reciprocal of radius) at the corresponding points of the two where b#,) is the probability of observing symbol k, in state j ac-
curves is scaled by l/c, i.e., z(5) = +IC((?- So) / c). It follows that cording to the probability distribution of the ith feature. In order to
adjust the influence of different features according to their dis-
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on December 11, 2008 at 16:44 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 18, NO. 10, OCTOBER 1996 1043
criminative power, we compute the weighted log-likelihood: where t, = t - d,& - 1) and d,& - 1) is the number of sample
points assigned to letter model m(l) (letter duration) up to sample
point t-1. Using these augmented scores, (3), (41, (5) are replaced
by the following:
derived form the weighted probabilities:
thus are not biased towards any particular state. The weights wi
are positive scalars and specify the relative dominance of each
feature. By augmenting the incoming scores at each grammar node
with the letter matching scores, the overall shapes of letters are
taken into consideration when the hypotheses leading to the
4 SEGMENTAL
FEATURES
grammar node are ranked and the one with the highest rank is
The HMM system described so far relies on localized features. chosen and propagated to the succeeding letter models. Through
During the decoding process, for each new sample data point this mechanism, letter matching scores computed on dynamically
taken in a time ordered sequence, the HMM hypotheses scores are allocated segments directly affect the decision making at each
discretely integrated and propagated through the HMM network. point during the Viterbi search, so that the system is biased to-
The incremental score at each step depends only on the local fea- wards sequences with better matches at the letter level.
ture at the current point. We will refer to methods of this type as It should be pointed out that in such an augmented HMM sys-
point oriented. The advantage of point oriented methods is that all tem, the state sequence resulted from Viterbi search is no longer
knowledge sources are integrated into a single model. Because of guaranteed to be the optimal sequence, because now the accumu-
this, all possible segmentations and identifications of the input lated score of a sequence leading to state i at sample point t not
pattern are considered in an efficient manner. On the other hand, only depends on the previous state, but also on how the previous
point oriented methods have the disadvantage of using only lo- state was reached (history reflected in the letter duration). This
calized observation measurements. Thus, shape information on dependence violates the basic condition for the Viterbi algorithm
larger scales is missing from the process. One way to remedy this to yield optimum solution. However, as shall be shown later, our
problem is to extract features from a window around each sample experimental results suggest that the gain obtained by incorporating
point, where the window could be of fixed (e.g., [17]) or variable segmental features by far outweighs the loss in the optimality of the
size (e.g., features in [61, the ratio of tangents feature described in algorithm.
the previous section, etc.). The weakness of this method is that it
The segmental matching score a (t,, t,) is computed using a cor-
does not adapt to the varying characteristics of pattern shapes,
relation based metric inspired by the metrics developed by Sinden
sizes, or segmentation boundaries.
There are methods that trade the efficiency advantage of point and Wilfong [18]. Given a segment U with coordinate sequence 17, =
oriented methods for the greater accuracy of measurements over <(q,y,), (x,, y,), ..., (xn,yn)>, the instance vector of a is defined as:
larger regions. These methods fall into the class called segment va = (TI, ij,, T2, ij,, ... ,X , , Y,) , where X, = x, - x, and ijt= y, - y,
oriented 141 where a script is first presegmented into letters or sub-
for 1 2 i 5 n, and (x~, ya) is the centroid of a. The normalized instance
character primitives according to certain predefined boundary con-
ditions (pen-ups, cusps, etc.) and one observation feature vector is vector of a, U, = va/ Inall is a translation and scale independent
computed for each segment. To avoid loosing potentially good hy- representation of segment a in R2" . Through a resampling proce-
potheses, segment oriented methods should first generate all possi- dure, a sample segment of arbitrary length can be mapped to a
ble segmentations because the scoring knowledge is not available at vector in Rm where N is a predetermined number. The difference
the time of segmentation and will be applied later as a post-process, between any two sample segments a and b is defined as: D(a, b) =
but in practices no system actually does this because the number of
possible segmentations makes the problem intractable. 0.5(1 - U, . U J ,whose value ranges from 0 (when a and b are iden-
We propose a new segment oriented method that ameliorates tical) to 1. The segmental matching score U$,, t2), is then defined
the usual tradeoff between efficiency and accuracy. We call this as: a,(t,,t,) = - ~ , D ( a ~ , a , , , ~,~where
) a, is the model segment for
the interleaved segmental method. Point oriented methods are used
to obtain partial segmentation hypotheses which are augmented letter pattern I, a, ,
1' 2
is the segment from sample point t, to t, on
with observation measurements made on the hypothesized seg-
the input sample sequence and w , is a weight factor.
ments. The resulted system is called an augmented HMM system.
In Section 2.3, we explained how hypotheses are propagated In order to compute the above segment matching score, a single
through a grammar node g in a commonly used implementation ((3), model segment needs to be derived for each letter pattern class.
(4), (5)).In order to incorporate seimenta1 shape information (in this Let ul, U,, . , uM be the normalized instance vectors of a set of
case letter shape information) into the search, we augment the in- prototypes for the letter pattern class I (which can be easily ob-
coming scores with letter matching scores, computed using global tained as side products of segmental training). A single model
letter shape models. To be more specific, let a$,, t,) be the likelihood segment representing this class is represented by vector w which
minimizes the sum of distances from individual prototypes. It can
score of the segment from sample point t, to t, on the observation M
sequence being matched to letter pattern class I, the augmented in- be easily shown that w = U"/ I GI, where U" = U, .
coming scores are defined as 6;(,,(t- 1) = 65(1,(t - 1)+ a,(t,,t- l), ,=I
Fig. 6 illustrates the effect of incorporating letter matching
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on December 11, 2008 at 16:44 from IEEE Xplore. Restrictions apply.
1044 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 18, NO. IO, OCTOBER 1996
scores in the HMM system by comparing the different letter level training, divided into two groups. Group A is the multiple-writer
segmentations obtained without and with letter matching scores. test group which contains 1,592 samples from the 10 training writ-
Fig. 6a shows the segmentation of a sample of word ”line” when ers, group B is the writer-independent test group composed of
the basic HMM system was used and the sample was falsely 3,823 samples from the eight other writers.
recognized as ”arc.” Fig. 6b shows the segmentation of the same Table 1 summarizes the performance of our recognition system.
sample when letter matching scores are applied and the sample is Error rates from three experiments as well as the settings for each
correctly recognized. The hypothesis shown in Fig. 6a is not se- experiment are listed. The top line of Fig. 7 shows some of the
lected now because the segment corresponding to letter “a” does samples from test set B that were not recognized correctly. In fact
not match the corresponding model segment well and therefore they are so sloppy that even human beings can hardly recognize
yields a poor letter matching score. them correctly. The bottom line of the same figure shows some of
the more legible samples from the same writer that were recog-
nized correctly.
_____
-.-. __...
,_____...
6
-..-.,~
1 - ligature
2 - letter “a.’ TABLE 1
3 - lfgature
-
5‘,
1 - ligature
2 - letter “1“
3 - ligature
4 -
letter “i“
5 - ligature
6 - letter “n”
7 - ligature
8 - letter “ e “
9 - ligature
10 - delayed stroke ”dot”
5 EXPERIMENTAL
RESULTS REMARKS
6 CONCLUDING
To test our system, we composed a vocabulary of 32 words, target- We have described experiments in handwriting recognition using
ing an underlying application of a pen driven graphics editor. The hidden Markov modeling and stochastic language modeling meth-
vocabulary covers all 26 lower case letters in the English alphabet, ods originally developed for speech applications. These methods are
contains many groups of easily confused words, such as ”line,” generalized in the AEGIS architecture. Subcharacter models called
”lines” and ”spline,” ”cut” and ”out,” and has both very short words nebulous stroke models are used to model the basic units in hand-
such as “in” and relatively long ones such as ”rectangle.” Samples writing. We introduce two new features for handwriting that are
from 18 writers have been collected, the writer group containing invariant under translation, rotation and scaling. Invariant features
men and women, left-handed and right-handed, of many different have been discussed extensively in computer vision literature. How-
cultural origins: American, European, Asian, and South American. ever, they have been rarely used previously in real applications due
During sample collection, the writers were told to write in their most to the difficulty involved in the estimation of high order derivatives.
natural way with no explicit constraints. Each word was written 15 We have demonstrated that these high order invariant features can
times by each writer. After removing invalid samples (samples with indeed be made useful with careful implementation.
misspelling or parts missing due to hardware problems), the final A method for combining segment oriented features in a sto-
data set consists of 8,595 word samples. chastic pattern recognizer has been developed. In this method,
The isolated letter samples used for letter training were all called interleaved segmental method, partial segmentation hy-
written by one writer, imitating all writing styles known to the potheses obtained using the point oriented features in a conven-
authors. 10-15 samples were written for each unique style of each tional dynamic programming search are combined with scores
letter and a particular stroke sequence is bound to those samples. based on segmental shape measurements made on the hypothe-
There are all together 54 lower-case letter models (as sequences of sized segments. Although certain optimality characteristics of the
stroke models), including delayed strokes and ligatures, composed HMM system are sacrificed in the process, significant reduction in
of a total of 93 stroke models. Each stroke is currently modeled by recognition error was achieved by this method, reducing writer-
a single state. independent error rate by nearly 50%.
Ten writers were chosen (after data collection) to be the Finally, we would like to point out that although we report
”training writers.” The whole word training set is composed of only recognition results on a relatively small vocabulary of 32
about 10 samples of each word from each training writer, a total of words in this paper, none of the techniques presented is inherently
3,180 samples. About 600 of the 3,180 training samples are used for dependent on the size of the vocabulary. The system can be easily
linear word training, and the whole training set is used for lattice adjusted to handle a large or unlimited vocabulary by imposing
word training. The test set is composed of all samples not used for different grammatical constraints. For example, one could use a
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on December 11, 2008 at 16:44 from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 18, NO. IO, OCTOBER 1996 1045
Authorized licensed
View publication stats use limited to: KnowledgeGate from IBM Market Insights. Downloaded on December 11, 2008 at 16:44 from IEEE Xplore. Restrictions apply.