A Practical Handbook of Speech Coders
A Practical Handbook of Speech Coders
A Practical Handbook of Speech Coders
Linear Prediction (LP) is a widely used and successful method that rep-
resents the frequency shaping attributes of the vocal tract in the source-
filter model of Section 2.3. For speech coding, the LP analysis char-
acterizes the shape of the spectrum of a short segment of speech with
a small number of parameters for efficient coding. Linear prediction,
also frequently referred to as Linear Predictive Coding (LPC), predicts
a time-domain speech sample based on a linearly weighted combination
of previous samples. LP analysis can be viewed simply as a method to
remove the redundancy in the short-term correlation of adjacent sam-
ples. However, additional insight can be gained by presenting the LP
formulation in the context of lossless tube modeling of the vocal tract.
This chapter presents a brief overview of the the lossless tube model
and methods to estimate the LP parameters. Different, equivalent rep-
resentations of the parameters are discussed along with the transforma-
tions between the parameter sets. Reference [137] discusses the lossless
tube model in great detail.
Sound waves are pressure variations that propagate through air (or
any other medium) by the vibrations of the air particles. Modeling
these waves and their propagation through the vocal tract provides a
framework for characterizing how the vocal tract shapes the frequency
content of the excitation signal.
∂p ∂ (u / A)
− =ρ (4.1)
∂x ∂t
and
∂u 1 ∂ ( pA) ∂A
− = + (4.2)
∂x ρc2 ∂t ∂t
where:
resulting in two equations with two unknowns that are integrated with
respect to time to yield the following volume velocity and pressure defi-
nitions:
x x
u ( x, t ) = u1 (t − ) − u 2 (t + ) (4.5)
c c
and
ρc x x
p ( x, t ) = (u1 (t − ) + u 2 (t + )) (4.6)
A c c
Further examination of these formulas reveals that u1 is a wave propa-
gating towards the open end of the tube, while u2 propagates toward the
closed end. Also note that both the sound pressure and volume can be
described by scaled addition/subtraction (superposition) of these waves.
This simple model of the vocal tract has the same properties of a sim-
ple electrical system. Comparing the wave equations of the lossless tube
system to the current i(x, t) and voltage v(x, t) equation of a uniform
lossless transmission line:
∂v ∂i
− =L (4.7)
∂x ∂t
and
∂i ∂v
− =C (4.8)
∂x ∂t
Equations 4.3 and 4.4 are the same as 4.7 and 4.8 with the variable
substitutions shown in Table 4.1.
The frequency response of a system of this type is well known, and
finding the frequency response of the lossless tube system requires only
the scaling shown in Table 4.1. The system has an infinite number of
poles on the j ω axis corresponding to the tube resonant frequencies of
c
4l
± nc
2l
, where n = 0, 1, ..., ∞. These resonances are plotted in Figure
4.2 for a limited frequency range.
FIGURE 4.2
Frequency response of a single lossless tube system.
FIGURE 4.3
Multiple concatenated tube model.
reflection coefficients signify how much energy is reflected and how much
is passed. These reflections cause spectral shaping of the excitation. This
spectral shaping acts as a digital filter with the order of the system equal
to the number of tube boundaries.
The digital filter can be realized with a lattice structure, where the
reflection coefficients are used as weights in the structure. Figure 4.4
displays the lattice filter structure. The ki is the reflection coefficient of
the ith stage of the filter. The flow of the signals suggests the forward
and backward wave propagation as mentioned previously. The input
is the excitation, and the output is the filtered excitation, that is, the
output speech. There are p stages corresponding to p tube sections. The
time delay for each stage in the concatenated tube model is ∆x/c where
c is the speed of sound.
The lattice structure can be rearranged into the direct form of the
standard all-pole filter model of Figure 4.5. In this form, each tap, or
predictor coefficient, of the digital filter delays the signal by a single
time unit and propagates a portion of the sample value. There is a
direct conversion between the reflection coefficients, ki of Figure 4.4, and
predictor coefficients, ai of Figure 4.5 (explained in the next section),
and they represent the same information in the LP analysis [137, 105].
From either the direct-form filter realization or the mathematical
derivation of lossless tube model [137, 105], linear prediction analysis
is based on the all-pole filter:
∑a
1 −k
H ( z) = and A( z ) = 1 − kz (4.9)
A( z ) k =1
of the filter.
By transforming to the time domain, it can be seen that the system
of Equation 4.9 predicts a speech sample based on a sum of weighted
past samples:
p
s' (n) = ∑ a k s(n − k ) (4.10)
k =1
where s′(n) is the predicted value based on the previous values of the
speech signal s(n).
substituting,
E= ∑ e 2 ( n) (4.13)
n
N −1−l
r (l ) = ∑ s ( m) s ( m + l )
m=0
(4.14)
ai(i ) = k i (4.17)
E (i ) = (1 − k i2 ) E (i −1) (4.19)
is met, the roots of the predictor polynomial will all lie within the unit
circle in the z-plane, and the all-pole filter will be stable. Filter stability
can be determined by checking this condition of the reflection coeffi-
cients.
N −1
c(i, k ) = ∑ s(m − i) s(m − k )
m =0
(4.21)
The log area ratios are computed from the reflection coefficients as:
1 + ki
Li = log (4.22)
1 − ki
1 + e Li
ki = (4.23)
1 − e Li
where A(z) is the inverse LP filter of Equation 4.9, and p is the order of
the LP analysis.
The p roots, or zeros, of P(z) and Q(z) lie on the unit circle, in com-
plex conjugate pairs (in addition, one root will be at +1, and one at –1).
Their angle in the z-plane represents a frequency, and pairs, or groups
of three, of these frequencies are responsible for the formants in the LP
spectrum. The bandwidth of the formant (how sharp the formant peak
is) is determined by how close together the LSFs are for that formant.
1
A( z ) = [ P ( z ) + Q( z )] (4.26)
2
σ
H LP (ϖ ) = (4.27)
| A(ϖ ) |
where σ is the square root of the energy of the segment, and A(ω) is
defined in Equation 4.9. The |A(ω)| is computed as the magnitude of
the DFT of the sequence a(n) = 1 – a1 – a2 – ··· – ap–1 – ap.
The plots indicate how the LP analysis models the general shape of the
spectrum, but does not model the fine structure. In the voiced example,
the LP representation does not model the pitch harmonics. The formants
are evident in the LP spectrum of Figure 4.6 at approximately 300,
1200, 2400, and 3200 Hz. In Figure 4.7, the LP spectra models the
overall vocal tract shape but does not model the random, noise-like fine
structure displayed in the unvoiced DFT spectra. A prominent formant
is evident at about 2900 Hz.