Speech Generation
Speech Generation
Speech Generation
Speech Generation
Speech is generated by pumping air from the lung through the vocal tract consisting of throat,
nose, mouth, palate, tongue, teeth and lips.
Human Vocal Tract. For the creation of nasal sounds, the nasal cavity can be coupled to the rest of the vocal tract
by the soft palate.
In voiced sounds, the tense vocal chords are vibrating and the airflow gets modulated. The
oscillation frequency of the vocal chords is called 'pitch'. The vocal tract colours the spectrum of
the pulsating air flow in a sound-typical manner. In voiceless sounds, the vocal chords are loose
and white-noise-like turbulences are formed at bottlenecks in the vocal tract. The remaining
vocal tract colours the turbulating airflow more or less depending on the position of the
bottleneck. Another type of voiceless sounds is created by an explosion-like opening of the vocal
tract.
Speech Generation Model. The vocal-tract filter is time-variant. A simplified model integrates the glottis filter into
the vocal tract filter.
Digitizing Speech
The analogue speech signals are sampled and quantized. The sampling frequency is usually 8
kHz, and typical analogue/digital-converters have a linear resolution of 16 bit. With this so-
called 'linear pulse code modulation' (linear PCM) we thus get a data stream of 128 kbit/s. The
digitized speech is then compressed in order to reduce the bit rate.
Logarithmic PCM
Simple speech codecs as ITU-T G.711 encode each speech sample with only 8 instead of 16 bits
by applying a quasi-logarithmic quantization curve (A-law in ETSI countries and mu-law in the
US). The idea behind logarithmic companding is to provide a constant signal-to-noise ratio
(SNR) independent from the signal level. This results in a data rate of 64 kbit/s. G.711 is the
standard codec used in the Public Switched Telephone Network (PSTN) and Integrated Services
Digital Network (ISDN) throughout the world.
However, 64 kbit/s is much too much for mobile telephone networks and other applications as
e.g. mass storage of speech signals. A further compression is therefore necessary. In contrast to
general compression algorithms as e.g. ZIP, speech compression algorithms are lossy and exploit
well-known statistical properties of speech and result thus in much higher compression rates.
Formants
One major property of speech is its correlation, i.e. successive samples of a speech signal are
similar. This is depicted in the following figure:
The short-term correlation of successive speech samples has consequences for the short-term
spectral envelopes. These spectral envelopes have a few local maxima, the so called 'formants'
which correspond to resonance frequencies of the human vocal tract. Speech consists of a
succession of sounds, the so-called phonemes. While speaking, humans continuously change the
setting of their vocal tract in order to produce different resonance frequencies (formants) and
therefore different sounds.
German vowel /a/: Spectral estimation of original signal (green), spectral estimation of predicted signal with 10-th
order short-term predictor (red) and frequency response of a 10-th order speech model filter based on the predictor
coefficients (blue). Four formants can easily be identified.
This (short-term) correlation can be used to estimate the current speech sample from past
samples. This estimation is called 'prediction'. Because the prediction is done by a linear
combination of (past) speech samples, it is called 'linear prediction'. The difference between the
original and the estimated signal is called 'prediction error signal'. Ideally, all correlation is
removed, i.e. the error signal is white noise. Only the error signal is conveyed to the receiver: It
has less redundancy, therefore each bit carries more information. In other words: With less bits,
we can transport the same amount of information, i.e. have the same speech quality.
The calculation of the prediction error signal corresponds to a linear filtering of the original
speech signal: The speech signal is the filter input, and the error signal is the output. Main
component of this filter is the (short-term) predictor. The goal of the filter is to 'whiten' the
speech signal, i.e. to filter out the formants. That is why this filter is also called an 'inverse
formant filter'.
LPC Filter.
While speaking, the formants continuously change. The short-term correlation thus also changes
and the predictor must be adapted to the these changes. Thus, the predictor and the prediction
error filter are adaptive filters whose parameters must be continuously estimated from the speech
signal.
For those interested in details, here are the mathematics for the calculation of the filter
coefficients:
Pitch
Voiced sounds as e.g. vowels have a periodic structure, i.e. their signal form repeats itself after
some milliseconds, the so-called pitch period TP. Its reciprocal value fP=1/TP is called pitch
frequency. So there is also correlation between distant samples in voiced sounds.
This long-time correlation is exploited for bit-rate reduction with a so-called long-term predictor
(also called pitch predictor).
Also the pitch frequency is no constant. In general, female speakers have higher pitch than male
speakers. Furthermore, the pitch frequency varies in a speaker-typical manner. In addition to that
there are voice-less sounds as consonants which do not have a periodic structure at all, and
mixed sounds with voice-less and voiced components also exist. Thus, also the pitch predictor
must be implemented as an adaptive filter, and the pitch period and the pitch gain must be
continuously estimated from the speech signal.
Spectrogram
A useful representation of the speech signal is the so-called spectrogram, which shows the
signal's power distribution with respect to time and frequency. The idea is to calculate the (short-
term) power spectral densities (this gives the power and frequency information) of successive
(this gives the time information) fragments of the signal. Usually, the time is printed on the x-
scale of a spectrogram and the frequency on the y-scale. The power is coded in the color, for
example red means high power and blue low power. The spectrogram in the following figure
shows all the so far discussed speech characteristics. The upper picture is the waveform, and the
lower the corresponding spectrogram. We will use exactly this speech signal later as input for
our example codecs.
CELP
Modern algorithms which are used in digital mobile networks as GSM or UMTS have a limited
storage of potential prediction error signals (the stochastic codebook), and the index of the best-
fitting signal is transferred to the receiver. The decoder has the same codebook available and can
retrieve the best-fitting error signal with the index. In contrast to a full-size error signal, such an
index needs only a few bits. Which of the potential error signals is the best one? To answer this
question, each prediction error signal is already vocal-tract filtered in the coder and the thus
synthesized speech signal is compared to the original speech signal by applying an error
measure, e.g. the mean square error. The prediction error signal which minimizes the error
measure is chosen. This procedure is called 'analysis-by-synthesis'. Mathematically, a vector
quantization of the original speech segment is performed.
CELP Principle: Analysis-by-Synthesis. The best codebook vector is determined and afterwards the corresponding
best gain value is calculated.
However, the speech quality is inferior: Although intelligible, the codec sounds artificial and
lacks naturalness. This is because voiced (periodic) speech frames are no good reproduced. This
can be improved by performing an APC-like pitch analysis first. The calculated LTP delay and
LTP gain values are used for inducing a pitch structure into the stochastic codebook vectors prior
to the analysis-by-synthesis. This idea is realized in the next example:
Closed-Loop CELP
The LTP analysis can also be included into the Analysis-by-Synthesis resulting in an even better
reproduction of voiced frames, especially for small LTP delay values which are typical for
female speakers. Such a structure is called 'Closed-Loop'. The next example shows a basic
Closed-Loop CELP codec:
Closed-Loop CELP Principle: LTP included in Analysis-by-Synthesis.
First, the optimal adaptive excitation is determined, and then in a second step, the optimal
stochastic excitation for minimizing the remaining error. Again, we get a quite good speech
quality with parameter settings resulting in a bit rate of 5.4 kbit/s. With MatLab, change again
parameter settings and observe how the speech quality improves or deteriorates.
Now you know how a Closed-loop CELP which is the basis for state-of-the-art speech codecs as
ITU-T G.726.1, ITU-T G.729, ETSI GSM Enhanced Fullrate Codec (EFR) etc. works! All these
mentioned advanced codecs use this concept, and more or less they differ only in the
implementation and some tricks leading to a somehow better speech quality:
Literature