Speech Enhancement
• Speech is used for communication
• When the speaker and listener are near to each other in a quite environment,
communication is easy and accurate
• However at a distance or in a noisy background, the listener’s ability to
understand suffers
• Speech can be transmitted electrically using the conversion media such as
microphone, loudspeaker and earphones and transmission media such as telephone
and radio
• These introduce distortions yielding noisy speech
• Such degradation lowers the intelligibility and/or quality of speech signal
• Speech Enhancement (SE) is a way of processing a speech signal which is
subjected to degradations (additive noise, interfering talkers, band limiting
etc.) to increase its intelligibility (likelihood of being correctly understood)
and/or its quality (naturalness and freedom from distortion, as well as ease for
• SE is useful in
1. Aircraft
2. Mobile
3. Military
4. Commercial communication
5. Aids for the handicapped
• Applications of speech enhancement
1. Speech over noisy transmission channels (cellular telephony and pagers)
2. Speech produced in noisy environments (vehicles and telephone booths)
• Objectives of SE
1. Reduction of noise level
2. Increase intelligibility
3. Reduction of auditory fatigue etc.
• For communication systems, the above two general objectives depend on
nature of noise and Signal to Noise Ratio (SNR)
• With medium SNR (>5𝑑𝐵), reducing noise level can produce a subjectively
natural speech signal at a receiver (telephone line) or can obtain reliable
transmission (tandem vocoder application)
• For low SNR, objective is to decrease noise level while retaining or increasing
intelligibility and reducing fatigue caused by heavy noise (motor or street
• Important things to be considered:
1. Need to detect intervals in a noisy signal where speech is absent (to estimate
aspects of noise alone)
2. The difficulty of enhancing weak, unvoiced speech
3. Difficulty of reducing non-stationary interfering noise
4. Frequent need to function in real time
5. Reducing computation
• SE can be used to improve speech signals
1. For human listening
2. As a pre or post processing step in speech coding systems
3. As a pre processing step in speech recognizers
• The methods depend on application
• For human listening
• SE should aim for high quality as well as intelligibility
• For coders or recognizers
• Quality is irrelevant (enhanced version of the speech could sound worse)
• This enhanced speech allows efficient parameter estimation in a coder or higher
accuracy in a recognizer
• For example, pre-emphasizing (balancing relative amplitudes across frequency)
the speech in anticipation of broadband channel noise (which may distort many
frequencies) doesn’t enhance the speech but allows easier noise removal later (via
• Performance evaluation of SE in the case of synthesizers or low rate coders,
subjective tests are required (quality or intelligibility cannot be measured
mathematically), measuring the effect of SE in recognition of noisy speech is
much easier (objective tests or listening tests, iterative SE)
• Most SE techniques improve speech quality without increasing intelligibility,
some reduce intelligibility
• SNR is an easily computed objective measure which reflects quality not
• Aspects of quality are important in reproducing music
• However when speech is subject to distortions, it is usually important to
render its intelligibility than its quality
Nature of Interfering
Interfering music has properties
similar to speech, allowing the
possibility of its suppression via
similar methods (except that some
musical cords have more than one 𝐹0,
thus spreading energy to more
frequencies than speech does)
SE Techniques
There are four classes of SE techniques each with its own advantages and
1. Subtraction of interfering sounds
The important speech related components of the distorted i/p signal are estimated
(retains them and eliminates other components) or estimates the corrupt portions of the
signal (subtracts them from the i/p) in time or frequency domain
2. Filtering out such sounds
Distorted components are suppressed using filtering methods in time or frequency
3. Suppression of non-harmonic frequencies
It works only for voiced speech and requires estimation of 𝐹0 and suppression of
spectral energy between desired harmonics (assumption is that enhancement of noisy
speech is feasible on strong periodic sounds and unvoiced sounds are irretrievably lost
or difficult to enhance)
4. Resynthesis using a vocoder
Adopts speech production model (e.g., low rate coding) and reconstructs a clean speech
signal based on the model using parameters estimated from the noisy speech
Enhancement by Resynthesis
• This improves speech by parametric estimation where speech synthesizers
extract parameters using the vocal tract model or previously analyzed speech
• Most synthesizers employ separate representations for vocal tract shape and
excitation information, coding the former with about 10 spectral parameters
(formants and bandwidths) and coding the latter with estimates of intensity
and periodicity
• Synthesis suffers from the same mechanical quality as found in the low-rate
speech coding and from degraded parameter estimation (due to noise), but can
be free of direct noise interference, if the parameters model the original speech
Enhancement by Resynthesis
• Low-rate speech coding methods (e.g., LPC) may be applied in some SE cases
• In a low dimensional speech production model, speech is assumed to come
from passing an excitation signal with a flat spectrum through an all-pole
vocal tract filter model and to have a z-transform response
𝑆 𝑧 = 𝐺 ⁄(1 − ∑ 𝑎 𝑧 ) - (2)
where, gain factor 𝐺 and LPC coefficients 𝑎 are directly related to the pole
locations (function of formant frequencies and bandwidths) with 𝑝 ≈ 10 − 12
• LPC works best when the coefficients are estimated from noise free speech but
degrades badly on noisy speech
• Non-causal wiener filtering based on LPC all-pole model attempts to solve for
ML estimate of speech in additive noise
• It tends to output speech with overly narrow bandwidths and large frame to
frame fluctuations
Enhancement by Resynthesis
• Spectral constraints based on redundancies in the speech production process
and on aspects of perception can overcome this flaw and raise speech quality
as well as accelerate convergence in iterative Weiner filtering
• The LPC synthesis filter is usually excited by either a periodic source or a
noise source depending on whether the analyzed speech is estimated to be
voiced or not
• Speech quality improves if some aspects of phase are preserved in voiced
excitation rather than being discarded (as often done in LPC)
• Phase can be retained by direct modelling of both the amplitude and phase of
the original speech harmonics or by modelling a version of the inverse filtered
• The use of direct harmonic modelling LPC resynthesis of speech degraded by
an interfering speaker is successful in raising quality but not intelligibility
Enhancement by Resynthesis
• Resynthesis is not a best choice among SE methods due to the difficulty of
estimating model parameters from distorted speech and due to inherent flaws
in most speech models
• However it has application in improving speech of handicapped speakers
• SE methods can reduce impulse noise (often found in telephony and some
analog storage) as well as stationary noise without degrading speech quality