5

08/02/2025
SPEECH PROCESSING
Speech Enhancement
Introduction
• Speech is used for communication
• When the speaker and listener are near to each other in a quite environment,
communication is easy and accurate
• However at a distance or in a noisy background, the listener’s ability to
understand suffers
• Speech can be transmitted electrically using the conversion media such as
microphone, loudspeaker and earphones and transmission media such as telephone
and radio
• These introduce distortions yielding noisy speech
• Such degradation lowers the intelligibility and/or quality of speech signal
• Speech Enhancement (SE) is a way of processing a speech signal which is
subjected to degradations (additive noise, interfering talkers, band limiting
etc.) to increase its intelligibility (likelihood of being correctly understood)
1
08/02/2025
Introduction
Introduction
and/or its quality (naturalness and freedom from distortion, as well as ease for
listening)
• SE is useful in
1. Aircraft
2. Mobile
3. Military
4. Commercial communication
5. Aids for the handicapped
• Applications of speech enhancement
1. Speech over noisy transmission channels (cellular telephony and pagers)
2. Speech produced in noisy environments (vehicles and telephone booths)
2
08/02/2025
Introduction
• Objectives of SE
1. Reduction of noise level
2. Increase intelligibility
3. Reduction of auditory fatigue etc.
• For communication systems, the above two general objectives depend on
nature of noise and Signal to Noise Ratio (SNR)
• With medium SNR (>5𝑑𝐵), reducing noise level can produce a subjectively
natural speech signal at a receiver (telephone line) or can obtain reliable
transmission (tandem vocoder application)
• For low SNR, objective is to decrease noise level while retaining or increasing
intelligibility and reducing fatigue caused by heavy noise (motor or street
noise)
Introduction
• Important things to be considered:
1. Need to detect intervals in a noisy signal where speech is absent (to estimate
aspects of noise alone)
2. The difficulty of enhancing weak, unvoiced speech
3. Difficulty of reducing non-stationary interfering noise
4. Frequent need to function in real time
5. Reducing computation
• SE can be used to improve speech signals
1. For human listening
2. As a pre or post processing step in speech coding systems
3. As a pre processing step in speech recognizers
• The methods depend on application
3
08/02/2025
Introduction
• For human listening
• SE should aim for high quality as well as intelligibility
• For coders or recognizers
• Quality is irrelevant (enhanced version of the speech could sound worse)
• This enhanced speech allows efficient parameter estimation in a coder or higher
accuracy in a recognizer
• For example, pre-emphasizing (balancing relative amplitudes across frequency)
the speech in anticipation of broadband channel noise (which may distort many
frequencies) doesn’t enhance the speech but allows easier noise removal later (via
de-emphasis)
• Performance evaluation of SE in the case of synthesizers or low rate coders,
subjective tests are required (quality or intelligibility cannot be measured
mathematically), measuring the effect of SE in recognition of noisy speech is
much easier (objective tests or listening tests, iterative SE)
Introduction
• Most SE techniques improve speech quality without increasing intelligibility,
some reduce intelligibility
• SNR is an easily computed objective measure which reflects quality not
intelligibility
• Aspects of quality are important in reproducing music
• However when speech is subject to distortions, it is usually important to
render its intelligibility than its quality
4
08/02/2025
Nature of Interfering Sounds

• Different types of interference may need different suppression techniques
• Noise may be continuous, impulse, or periodic and its amplitude may vary
across frequency (occupying broad or narrow spectral ranges)
• e.g., background noise or transmission noise is often continuous and broadband
(sometimes modelled as “white noise” – uncorrelated time samples with a flat
spectrum)
• Other distortions may be abrupt and strong but of very brief duration
• e.g., radio static and fading
• Hum noise from machinery or from AC power lines may be continuous, but
present only at few frequencies
• Noise which is not additive (e.g., multiplicative or convolutional) may be
handled by applying a logarithmic transformation to the noisy signal either in
the time domain (for multiplicative noise) or in frequency domain

(for convolutional noise), which converts the distortion to an additive one
(allows basic SE methods to be applied)
• Interfering speakers present a different problem for SE
• When people hear several sound sources, they can often direct their attention
to one specific source and perceptually exclude others
• This is facilitated by the stereo reception via listener’s two ears
• In binaural sound reception, the waves arriving at each ear are slightly
different (e.g., in time delays and amplitudes) and one can often localize the
position of the source and attend to that source suppressing perception of
other sounds
• How brain suppresses such interference is poorly understood
• Monaural listening (e.g., via telephone handset) has no directional cues and
5
08/02/2025

the listener must rely on the desired sound source being stronger (or having
major energy at different frequencies) than competing sources
• When a desired source can be monitored by several microphones, techniques
can exploit the distance between microphones
• However most practical SE applications involve monaural listening with i/p
from one microphone (directional and head-mounted, noise cancelling
microphones can often minimize the effects of echo and background noise)
• The speech of interfering speakers occupies the same overall frequency range
as that of a desired speaker but such voiced speech usually has 𝐹0 and
harmonics at different frequencies
• Some SE methods attempt to identify the strong frequencies either of desired
speaker or of the unwanted source and to separate their spectral components
to the extent that the components do not overlap
Nature of Interfering
Sounds
Interfering music has properties
similar to speech, allowing the
possibility of its suppression via
similar methods (except that some
musical cords have more than one 𝐹0,
thus spreading energy to more
frequencies than speech does)
6
08/02/2025
SE Techniques
There are four classes of SE techniques each with its own advantages and
limitations
1. Subtraction of interfering sounds
The important speech related components of the distorted i/p signal are estimated
(retains them and eliminates other components) or estimates the corrupt portions of the
signal (subtracts them from the i/p) in time or frequency domain
2. Filtering out such sounds
Distorted components are suppressed using filtering methods in time or frequency
domain
3. Suppression of non-harmonic frequencies
It works only for voiced speech and requires estimation of 𝐹0 and suppression of
spectral energy between desired harmonics (assumption is that enhancement of noisy
speech is feasible on strong periodic sounds and unvoiced sounds are irretrievably lost
or difficult to enhance)
4. Resynthesis using a vocoder
Adopts speech production model (e.g., low rate coding) and reconstructs a clean speech
signal based on the model using parameters estimated from the noisy speech
Spectral Subtraction (SS)

• An interfering sound may be captured from the desired speech (i.e., in multi-
microphone applications)
• The latter is enhanced by subtracting out the former
• For best results, a second microphone is required to be closer to the noise source (a
noise reference to subtract it from the primary recording) than the primary
microphone recording the desired speech
• In single microphone applications, signal analysis during pauses gives an estimate of
noise and then using adaptive filter model of noise, it can be supressed
• This employs an average spectral model of noise and gives much less enhancement
than a two microphone method as it can only identify the spectral distribution of
noise and not its time variation
• Frequencies where the noise energy is high can be suppressed in the signal but this
distorts the desired speech at these frequencies
7
08/02/2025

• This method is applicable for speech affected with stationary noise
• It transforms the noisy speech and an estimate of interference 𝑖(𝑛) into
Fourier transforms 𝑃(𝜔) and 𝐼(𝜔)
• Their magnitudes are subtracted yielding 𝑃(𝜔) − 𝛼 𝐼(𝜔) - (1)
• This is combined with the original phase of 𝑃(𝜔) and transformed back into a
time signal
• The noise estimation factor 𝛼 ≥ 1 (typically 𝛼 = 1.5) helps minimize some
distortion effects
• Any negative values from the above equation are reset to zero assuming that
such noisy frequencies cannot be recovered
• This method corresponds to Maximum Likelihood (ML) estimation of noisy
speech signal

• In case of negative SNR (i.e., more energy in the interference than in the
desired speech), this method works well for both general noise and interfering
speakers although musical tone or noise artifacts often occur at frame
boundaries in such reconstructed speech
• The tones are due to random appearance of narrowband residual noises at
frequencies where the SS method yields a negative spectral amplitude (and the
algorithm arbitrarily assigns a zero output)
• These extraneous tones can be reduced by altering the SS technique (raising 𝛼
or allowing it to vary in time or frequency), frequency smoothing, time
smoothing (but introduces echoes)
• SS reduces noise power (improving quality), but often reduces intelligibility (in
low SNR situations) due to suppression of weak portions of speech (high
frequency formants and unvoiced speech)
8
08/02/2025

• There are many variants of SS such as
⁄
• Replacing eq.(1) with 𝑃(𝜔) − 𝐼(𝜔)
• Basic SS uses 𝛼 = 1
• Power SS uses 𝛼 = 2
• Advantage of Power SS is that time domain methods involving autocorrelation
signals are feasible
• Sometimes, the first component in eq.(1) is replaced by an average over a few
frames (this reinforces the consistent speech components, harmonics at the
expense of random noise components), this will smooth out speech transitions
but leading to slurring effects
9
08/02/2025
Enhancement by Resynthesis
• This improves speech by parametric estimation where speech synthesizers
extract parameters using the vocal tract model or previously analyzed speech
• Most synthesizers employ separate representations for vocal tract shape and
excitation information, coding the former with about 10 spectral parameters
(formants and bandwidths) and coding the latter with estimates of intensity
and periodicity
• Synthesis suffers from the same mechanical quality as found in the low-rate
speech coding and from degraded parameter estimation (due to noise), but can
be free of direct noise interference, if the parameters model the original speech
accurately
• Low-rate speech coding methods (e.g., LPC) may be applied in some SE cases
• In a low dimensional speech production model, speech is assumed to come
from passing an excitation signal with a flat spectrum through an all-pole
vocal tract filter model and to have a z-transform response
𝑆 𝑧 = 𝐺 ⁄(1 − ∑ 𝑎 𝑧 ) - (2)
where, gain factor 𝐺 and LPC coefficients 𝑎 are directly related to the pole
locations (function of formant frequencies and bandwidths) with 𝑝 ≈ 10 − 12
• LPC works best when the coefficients are estimated from noise free speech but
degrades badly on noisy speech
• Non-causal wiener filtering based on LPC all-pole model attempts to solve for
ML estimate of speech in additive noise
• It tends to output speech with overly narrow bandwidths and large frame to
frame fluctuations
10
08/02/2025
• Spectral constraints based on redundancies in the speech production process
and on aspects of perception can overcome this flaw and raise speech quality
as well as accelerate convergence in iterative Weiner filtering
• The LPC synthesis filter is usually excited by either a periodic source or a
noise source depending on whether the analyzed speech is estimated to be
voiced or not
• Speech quality improves if some aspects of phase are preserved in voiced
excitation rather than being discarded (as often done in LPC)
• Phase can be retained by direct modelling of both the amplitude and phase of
the original speech harmonics or by modelling a version of the inverse filtered
speech
• The use of direct harmonic modelling LPC resynthesis of speech degraded by
an interfering speaker is successful in raising quality but not intelligibility
• Resynthesis is not a best choice among SE methods due to the difficulty of
estimating model parameters from distorted speech and due to inherent flaws
in most speech models
• However it has application in improving speech of handicapped speakers
• SE methods can reduce impulse noise (often found in telephony and some
analog storage) as well as stationary noise without degrading speech quality
11

5

Uploaded by

Copyright:

Available Formats

5

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5

Uploaded by

Copyright:

Available Formats

08/02/2025

Nature of Interfering Sounds

Nature of Interfering Sounds

Nature of Interfering Sounds

Spectral Subtraction (SS)

Spectral Subtraction (SS)

Spectral Subtraction (SS)

Spectral Subtraction (SS)

Spectral Subtraction (SS)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.