Phoneme Recognition: Neural Networks vs
Hidden Markov Models
A. Waibel T. Hanazawa G. Hinton K. Shikano

ATR Interpreting Telephony Research Laboratory

University of Toronto and Canadian Institute for Advanced Research
Carnegie-Mellon University

aspects. We demonstrate through extensive performance evaluation

Abstract that superior recognition results can be achieved.

neme recognition which is characterized by two important properties: 2 Time Delay Neural Networks
1.) Using a 3 layer arrangement of simple computing units, it can rep-
resent arbitrary nonlinear decision surfaces. The TDNN learns these To be useful for speec feed forward neural net-
decision surfaces automatically using error back-propagatioii[l]. 2.) work must have a n it should have multiple
he time-delay arrangement enables the network to discover acoustic- layers and sufficien units in each of these
honetic features and the temporal relationships between them inde- layers. This is to 11 have t h e ability t o
endent of position in time and hence not blurred by temporal shifts learn complex non Second, the network
in the input. For comparison, several discrete Hidden Markov Mod- should have the ab ips between events in
els (HMM) were trained to perform the same task, i.e., the speaker- time. These events , b u t might also be the
dependent recognition of the phonemes "B", "D", and "G" extracted output of higher level feature detectors. Third, the actual features or
abstractions learned by the network should be invariant under transla-
tion in time. Fourth the learning procedure should not require precise
We show that the TDNN "invented" well-known acoustic-phonetic
amount of tra' hat the network is forced t o encode the
to the same concept. training data by ex egularity. In the following, we describe a

of phonemes, in particular, the voiced

1 Introduction stops "B", "D" an

In recent years, the advent of new learning procedures and the avail-
ability of high speed parallel supercomputers have given rise t o a re-
newed interest in connectionist models of intelligence[l]. These mod- The basic unit used in many neural networks computes the
els are particularly interesting for cognitive tasks that require massive ted sum of its inputs and then passes this sum through a non
constraint satisfaction, i.e., the parallel evaluation of many clues and function, most commonly a threshold or sigmoid fnnction[2,1]
r interpretation in the light of numerous interrelated con- TDNN, this basic unit is modified by introducing delays D1
ause of the far-reaching implications t o speech recogni- J inputs of such a unit now will b
etworks have recently been compared with other pattern e for each delay and one for the undela
recognition classifiers[2] and explored as alternative to other speech 16, for example, 48 weights will be nee
recognition techniques (see [2,3] for review). Some of these studies re- to compute the weighted sum of the 16 inputs, with each input now
port very incouraging performance results[4], but others show neural
nets as underperforming existing techniques. One possible explana-
mixed comparative performance results so far might be
given by the inability of many neural network architectures to deal
properly with the dynamic nature of speech. Various solutions t o this
problem, however, are now beginning to emerge[5,6,7,8] and continued
work in this area is likely to lead to more powerful speech recognition
systems in the future.
To capture the dynamic nature of speech a network must b e able
to 1.) represent temporal relationships between acoustic events, while
a t the same time 2 ) provide for rnvanance under translatron in time.
The specific movement of a formant in time, for example, is an im-
portant cue to determining the identity of a voiced stop, but it is
irrelevant whether the same set of events occurs a little sooner or
later in the course of time. Without translation invariance a neural
net requires precise segmentation, to allgn the input pattern properly.
Since this is not always possible in practice, learned features tend
to get blurred (in order to accommodate slight misalignments) and
their performance deteriorates. In the present paper, we describe a
Time Delay Neural Network (TDNN), which addresses both of these Figure 1. A Time Delay Neural Network (TDNN) unit

measured a t three different points in time. In this way a TDNN unit delay windows over the lower level units' firing patterns. At the lowest
has the ability to relate and compare current input with the past his- level, these firing patterns simply consist of the sensory input, i.e., the
tory of events. The sigmoid function was chosen as the non-linear spectral coefficients.
output function F due to its convenient mathematical properties[l,9]. Each TDNN unit outlined in this section has the ability to encode
For the recognition of phonemes, a three layer net is constructed. temporal relationships within the range of the N delays. Higher layers
Its overall architecture and a typical set of activities in the units are can attend to larger time spans, so local short duration features will be
shown in Fig.2. formed a t the lower layer and more complex longer duration features
At the lowest level, 16 melscale spectral coefficients serve as input at the higher layer. The learning procedure ensures that each of the
to the network. Input speech, sampled a t 12 kIIz, was hamming units in each layer has its weights adjusted in a way that improves
windowed and a 256-point F F T computed every 5 msec. hfelscale the network's overall performance.
coefficients were computed from the power spectrum[3] and adjacent
coefficients in time collapsed resulting in an overall 10 msec frame rate. 2.2 Learning in a TD N N
The coefficients of an input token (in this case 15 frames of speech
centered around the hand labeled vowel onset) were then normalized Several learning techniques exist for optimization of neural net-
to lie between -1.0 and $1.0 with the average a t 0.0. Fig.2 shows works[l,%]. For the present network we adopt the Back-propagation
the resulting coefficients for the speech token "DA" as input to the Learning Procedure[l,9]. This procedure iteratively adjusts all the
network, where positive values are shown as black and negative values weights in the network so as to decrease the error obtained a t its
as grey squares. output units. To arrive a t a translation invariant network, we need
This input layer is then fully interconnected to a layer of 8 time to ensure during learning that the network is exposed to sequences
delay hidden units, where J = 16 and N = 2 (i.e., 16 coefficients of patterns and that it is allowed (or encouraged) to learn about the
most powerful cues and sequences of cues among them. Conceptually,
the back-propagation procedure is applied to speech patterns that are
Output Layer stepped through in time. An equivalent way of achieving this result
is to use a spatially expanded input pattern, i.e., a spectrogram plus
/' \

integration some constraints on the weights. Each collection of TDNN-units de-
Hidden Layer 2 scribed above is duplicated for each one frame shift in time. In this
way the whole history of activities is available at once. Since the
shifted copies of the TDNN-units are mere duplicates and are to look

; ; ;.;U
; : : : I" . - .-
.... 1 a
for the same acoustic event, the weights of the corresponding connec-
tions in the time shifted copies must be constrained to be the same. To
.Ij;. ...

... 3 'C Hidden Layer 1 realize this, we first apply the regular back-propagation forward and
._.... 1
.11 3
backward pass to all time shifted copies as if they were separate events.
I.. m
This yields different error derivatives for corresponding (time shifted)
\ ... connections. Rather than changing the weights on time-shifted con-
nections separately, however, we actually update each weight on cor-
responding connections by the same value, namely by the average of
all corresponding time-delayed weight changes'. Fig.2 illustrates this
by showing in each layer only two connections that are linked to (con-
Input Layer
strained to have the same value as) their time shifted neighbors. Of
course, this applies to all connections and all time shifts. In this way,
the network is forced to discover useful acoustic-phonetic features in
the input, regardless of when in time they actually occurred. This is
an important property, as it makes the network independent of error-
15frames prone preprocessing algorithms, that otherwise would be needed for
10 msec frame rate time alignment and/or segmentation.
Figure 2: The TDNN architecture (input: " D A ) The procedure described here is computationally rather expensive,
due to the many iterations necessary for learning a complex multi-
over three frames with time delay 0, 1 and 2). An alternative way
dimensional weight space and the number of learning samples. In
of seeing this is depicted in Fig.2. It shows the inputs to these time
our case, about 800 learning samples were used and between 20,000
delay units expanded out spatially into a 3 frame window, which is
and 50,000 iterations (step-size 0.002, momentum 0.1) of the back-
passed over the input spectrogram. Each unit in the first hidden layer
propagation loop were run over all training samples. For greater
now receives input (via 48 weighted connections) from the coefficients
learning speed, simulations were run on a 4 processor Alliant super-
in the 3 frame window. The particular delay choices were motivated
computer and a staged learning strategy[3] mas used. Learning still
by earlier studies[3].
took about 4 days, but additional substantial increases in learning
In the second hidden layer, each of 3 TDNN units looks a t a 5
speed are possible[3]. Of course, this high computational cost ap-
frame window of activity levels in hidden layer 1 (i.e., J = 8, N =
plies only to learning. Recognition can easily be run in better than
4). The choice of a larger 5 frame window in this layer was motivated
by the intuition that higher level units should learn to make decisions
over a wider range in time based on more local abstractions a t lower
levels. 3 Hidden Markov Models
Finally, the output is obtained by integrating (summing) the ev-
idence from each of the 3 units in hidden layer 2 over time and con- As an alternative recognition approach we have implemented seve-
necting it to its pertinent output unit (shown in Fig.2 over 9 frames ral Hidden Markov Models (HMM) aimed at phoneme recognition.
for the " D output unit). In practice, this summation is implemented Hh'IMs are currently the most successful and promising approach
simply as another TDNN unit which has fixed equal weights to a row [10,11,12] in speech recognition as they have been successfully ap-
of unit firings over time in hidden layer 2. plied to the whole spectrum of recognition tasks. HMhls' success is
When the TDNN has learned its internal representation, it per-
forms recognition by passing input speech over the TDNN units. In 'Note that in the experiments reported below these wejght changes were actu-
terms of the illustration of Fig.2 this is equivalent t o passing the time ally carried out &er presentation of all training samples[9].

ir ability to cope with the variability in speech by 4.2 Results

modeling The HhiMs developed in our laboratory
Table 1 shows the results from t h ion experiments described
eme recognition, more specifically the voiced stops
. More detail including results from experiments above. As can be seen, for all t h , the TDNN yields con-
siderable performance improvemen ur HMM. Averaged over
these models are given elsewhere[l3,3] and we will
restrict ourselves to a brief description of our best configuration. all three speakers, the error rate is r om 6.3% to 1.5%, a more
The acoustic front end for Hidden Markov Modeling is typically a than four fold reduction in error.
vector quantizer that classlfies sequences of short-time spectra. Input
speech was sampled at 12kH2, preemphasized by (1 - 0.97 z-’) and
windowed using a 256-point Hamming window every 3 msec. Then
a 12-order LPC analysis was carried out A codebook of 256 LPC
from 216 phonetically balanced
e Weighted Likelihood Ratio augmented with power values
31 was used as LPC distance measure for vector quantiza-
HMM with four states and six transitions (the last state
selfloop) was used in this study. The HMM probability
e tramed using vector sequences of phonemes according to
d-backward algorithm[lO]. The vector sequences for ”B”,
G include a consonant part and five frames of the follow-
I. This is to model important transient informations, such
nt movement and has lead to improvements over context in-
models [13] The HMM was trained until convergence using
50 phoneme tokens of vector sequences per speaker and pho-
Typically, about 10 to 20 learning Iterations were required
about one hour on a VAX 8700
Floor values were set on the output probabilities to avoid errors caused
by zero-probabilities. We have experimented with composite models,
re trained using a combination of context-independent and
endent probability values[l2], but in our case no signifi-

1 ‘I
4 Recognition Experiments
We now turn to an experimental evaluation of the two techniques Figure 3: Scatter plots showing log probabilities/activation levels for
described in the previous sections. To provide a good framework
for comparison, the same experimental conditions were given t o both using an HMM (left) and A TDNN
methods For both, the same training data was used and both were
tested on the same testing database as described below.

Experimental Conditions
evaluation, we have used a large vocabulary database
Japanese words[3]. These words were uttered in isola-
le native Japanese speakers (MAU, MHT and MNM,
nnouncers). All utterances were recorded in a sound s of a TDNN have a tendency to
digitized at a 12 kHz sampling rate The database een from the cluster of dots in the
to a training set and a testing set of 2620 utterances r plots Most output units tend t o
tic tokens were extracted
he phoneme recognition task chosen for this experiment was the
nition of the voiced stops, i e , the phonemes ”B”, ”D” and ”G”.
The actual tokens were extracted from the utterances using manually to improve recognition f one were to eliminate among
selected acoustic-phonetic labels provided with the database[3] Both speaker h4AU’s tokens highest activation level is less
and the HMMs, were trained and than 0 5 and those more closely competing
the scatter plots; see[3]
tabase, no preselection of tokens was performed. All to- d be rejected, while the
f the three voiced stops were included. Since remaining substitu
were extracted from entire utterances and not
nificant amount of acoustic variability is intro- 4.3 The Learned In presentations of a
c context and the token’s position within the
our recognition algorithms. are only given the
a token and must find their own ways of repre-
tions of speech. Since recognition results based
are not meaningful, we report in the following
, i e., from performance evaluation found in 131. Fig 2 and the of Fig 4 show two typical instances
er the separate testing data set of a ”D” out of two differe tic contexts (”DA” and ”DO”, re-

’In Japanese, for example, a “G” is nasalzed, when it occurs embedded in a n fires strongly, despite tl
utterance, but not in utterance irutial position[3]. considerably from each ot y the internal firings in these

nition. By use of two hidden layers in addition to an input and output
layer it is capable of representing complex non-linear decision surfaces.
Three important properties of the TDNNs have been observed. First,
our TDNN was able to invent without human interference meaningful
linguistic abstractions in time and frequency such as formant tracking
and segmentation. Second, we have demonstrated that it has learned

to form alternate representations linking different acoustic events with

the same higher level concept. In this fashion it can implement trading
"...... rT:L+
- 1mm.m.1..
4 , q;,: ........
.--.._ ::k:
relations between lower level acoustic events leading t o robust recogni-
tion performance despite considerable variability in the input speech.
Third, we have seen that the network is translation-invariant and does
not rely on precise alignment or segmentation of the input. We have
compared the TDNN's performance with the best of our HMR4s on a
speaker-dependent phoneme recognition task. The TDNN achieved a
recognition of 96.5% compared t o 93.7% for the HMM, i.e., a fourfold
reduction in error.

