Paper 1
Paper 1
Paper 1
DOI 10.1007/s10772-013-9213-5
Abstract In this paper, we propose a speech enhancement perceptual evaluation of speech quality (PESQ) score), and
method where the front-end decomposition of the input spectrograms with informal listening tests, we show that the
speech is performed by temporally processing using a filter- proposed speech enhancement method outperforms than the
bank. The proposed method incorporates a perceptually mo- spectral subtractive-type algorithms and improves quality
tivated stationary wavelet packet filterbank (PM-SWPFB) and intelligibility of the enhanced speech.
and an improved spectral over-subtraction (I-SOS) algo-
rithm for the enhancement of speech in various noise envi- Keywords Speech enhancement · Stationary wavelet
ronments. The stationary wavelet packet transform (SWPT) packet filterbank · Critical band rate scale · Spectral-over
is a shift invariant transform. The PM-SWPFB is obtained subtraction · Adaptive noise estimation · Objective
by selecting the stationary wavelet packet tree in such a man- measure · Speech spectrograms
ner that it matches closely the non-linear resolution of the
critical band structure of the psychoacoustic model. After
the decomposition of the input speech, the I-SOS algorithm 1 Introduction
is applied in each subband, separately for the estimation of
speech. The I-SOS uses a continuous noise estimation ap- Speech is the most direct way for humans to convey in-
proach and estimate noise power from each subband with- formation to one another in various application areas such
out the need of explicit speech silence detection. The sub- as, mobile communication, speech recognition, and speaker
band noise power is estimated and updated by adaptively identification (O’Shaughnessy 2007). In many situations,
smoothing the noisy signal power. The smoothing parameter speech signal is severely degraded due to various types of
in each subband is controlled by a function of the estimated background noises that limit their effectiveness for commu-
signal-to-noise ratio (SNR). The performance of the pro- nication and make the listening task difficult for a direct lis-
posed speech enhancement method is tested on speech sig- tener (Ephraim 1992). The background noise may be due to
nals degraded by various real-world noises. Using objective the noisy environment in which the speaker is speaking, or it
speech quality measures (SNR, segmental SNR (SegSNR), may arise from noise in the transmission media. Therefore,
the reduction of noise components from the noisy speech
and, in turn, its enhancement have been the primary atten-
N. Upadhyay (B) tion of research in the field of speech signal processing over
Department of Electrical & Electronics Engineering, Birla the past decades and it is still remained as an open prob-
Institute of Technology & Science Pilani, Pilani 333031,
lem. The goal of speech enhancement research is to min-
Rajasthan, India
e-mail: navneet_upd@rediffmail.com imize the effect of noise and make speech more pleasant
and understandable to the listener and thereby, improving
A. Karmakar one or more perceptual aspects of speech, such as overall
Integrated Circuit Design Group, Central Electronics Engineering
speech quality and/or intelligibility (Ephraim et al. 2006;
Research Institute/Council of Scientific and Industrial Research,
Pilani 333031, India Ephraim and Cohen 2006). Speech quality and intelligibility
e-mail: abhijit@ceeri.ernet.in are not necessarily correlated.
Int J Speech Technol
The classification of speech enhancement methods de- (SNR) is low. The reason is that in low SNR conditions, it
pends on the number of microphones that are used for col- is difficult to suppress noise without introducing remnant
lecting speech such as single, dual or multi-channel. Al- noise and speech distortions. The methods that adopt the
though the performance of multi-channel speech enhance- masking property of HAS can reduce the effect of remnant
ment is found to be better than single channel speech noise, but the drawback is the large computational effort as-
enhancement (O’Shaughnessy 2007), the single channel sociated with the subband decomposition. Therefore, spec-
speech enhancement is still a significant field of research tral subtractive-type algorithms are in general effective in
interest because of its simple implementation and easy com- reducing the background noise but not in improving intel-
putation. Single channel speech enhancement method uses ligibility (Upadhyay and Karmakar 2012). Therefore, it is
only one microphone to collect noisy speech data with no necessary to find a new way to improve intelligibility and
additional information about the degrading noise and the reduce speech distortion when removing noise.
clean speech being available. This method requires the noise The wavelet transform (WT) is applied to various areas
estimation process during speech pauses. Also, the estima- of research including speech and image de-noising, com-
tion of the spectral magnitude from the noisy speech is easier pression, detection, and pattern recognition, which can eas-
than the estimate of both magnitude and phase. In Lim and ily be computed by filtering a signal with multi-resolution
Oppenheim (1979), it is revealed that the short-time spectral filterbanks (Mallat 1989, 2009; Strang and Nguyen 1996).
magnitude (STSM) is more important than phase informa- In Donoho (1995), wavelet transform has been applied for
tion for intelligibility and quality of speech signals. enhancement of speech on the basis of a simple and pow-
The spectral subtraction (Boll 1979) is one of the most erful de-noising technique known as wavelet thresholding
widely used single channel speech enhancement methods (shrinking) technique. However, it is not possible to separate
based on the direct estimation of STSM. The main attrac- the speech signal from noises by a simple threshold because
tion of spectral subtraction method is: (i) its relative sim- applying a uniform threshold to all wavelet coefficients
plicity, in that it only requires an estimate of the noise would remove some speech components, as well, while sup-
power spectrum, and (ii) its high flexibility against subtrac- pressing additional noise. This is especially true for the
tion parameters variation. Usually, the spectral subtraction non-stationary noise degrade signal and some deteriorated
method uses the statistical information of silence regions, speech conditions. The unvoiced components of speech are
detected by voice activity detector (VAD). However, if back- often eliminated from this method which losing much a-
ground noise is non-stationary, it will be difficult to use tonic information that affect the quality of reconstructed sig-
VAD. Also, the drawbacks of speech enhanced by spectral nal. In Chen and Wang (2004), Lu and Wang (2004), per-
subtraction is that the enhanced speech contains the percep- ceptual wavelet packet decomposition (PWPD) has been ap-
tual noticeable spectral artifacts composed of unnatural ar- plied for enhancement of speech. Although, PWPD leads to
tifacts with random frequencies, known as remnant musical improved speech quality with reduced computational load
noise and perceptual annoys to human ear. In recent years, but speech distortion caused by down-sampling, remains as
a number of speech enhancement algorithms are proposed a problem. Yet, it remains unclear as to which speech en-
which deal with the modifications of the spectral subtrac- hancement algorithm performs not so well where the back-
tion method to combat the problem of remnant musical noise ground noise level and characteristics are constantly chang-
artifacts (Berouti et al. 1979; Kamath and Loizou 2002; ing.
Virag 1999) and improve the quality of speech in noisy envi- In this paper, we propose a speech enhancement method
ronments. These frequency domain speech enhancement al- which incorporates a perceptually motivated stationary
gorithms constitute a family of spectral subtractive-type al- wavelet packet filterbank (PM-SWPFB) and an improved
gorithms and are based on subtracting the estimated STSM spectral over-subtraction (I-SOS) algorithm together to en-
of the noise from the STSM of noisy speech or by multiply- hance the speech in various non-stationary environments.
ing the noisy spectrum with gain functions and to combine it The main reason for selecting stationary wavelet packet
with the phase of the noisy speech (Upadhyay and Karmakar transform (SWPT) as a transforming tool because it does
2012). In Boll (1979), magnitude averaging rule is proposed. not require the down-sampler after filtering (Olhede and
In Berouti et al. (1979), the over-subtraction of noise is pro- Walden 2005; Walden and Contreras 1998). Therefore, the
posed and defined a spectral floor to make remnant musical absence of the down-sampler leads to a full rate decom-
noise inaudible. In Kamath and Loizou (2002), a speech en- position and each subband contains the same number of
hancement algorithm by incorporating the multi-band model samples as the input. Thus, the SWPT not only improves
in frequency domain is proposed. In Virag (1999), the use of the frequency resolution, but also maintains a temporal
masking properties of the human auditory system (HAS) is resolution. The idea behind the use of a perceptual fil-
made to reduce the effect of remnant noise. However, the terbank model of human ear with non-uniform frequency
performances of these algorithms are not satisfactory in ad- resolution may lead to improve intelligibility and percep-
verse conditions, particularly when the signal-to-noise ratio tual quality of degraded speech (Chen and Wang 2004;
Int J Speech Technol
Lu and Wang 2004). Therefore, front-end decomposition the input noisy speech y(n) into non-uniform subbands
of noisy speech is performed by PM-SWPFB. The PM- yk (n), k = 1, 2, . . . , 17.
SWPFB obtained by selecting the stationary wavelet packet (ii) Secondly, we use an improved spectral over-subtraction
tree in such a manner that it closely matches the critical (I-SOS) algorithm to estimate the speech in each sub-
band structure of the psychoacoustic model. Then, I-SOS band by using the adaptive noise estimation approach.
algorithm is applied in each band, separately for estimation The adaptive noise estimation approach does not re-
of speech. The I-SOS algorithm uses a continuous noise es- quired speech silence detector. In Fig. 2, the block dia-
timation approach which does not needed explicit speech si- gram of the subband speech enhancement algorithm is
lence detection, called adaptive noise estimation approach. presented.
The noise estimation approach estimates and updates the (iii) Finally, the enhanced speech ŝ(n) is reconstructed by
subband noise power by adaptively smoothing the noisy sig- the stationary wavelet packet synthesis filterbank stage.
nal power. The smoothing parameter is chosen to be a sig-
moid function and controlled by a function of the estimated 2.1 Construction of perceptually motivated stationary
SNR in each subband. The proposed speech enhancement wavelet packet filterbank
method attempts to find the optimal tradeoff between reduc-
ing noise, increasing intelligibility, and keeping the distor-
One of the unique features of auditory processing of human
tion acceptable to a human listener.
is the existence of critical bands (CBs). When the CBs are
The rest of this paper organized as follows. Section 2
placed next to each other, the critical band rate (CBR) scale
presents details of the proposed speech enhancement method.
is obtained. This CBR scale is based on the fact that our
It consists of construction of perceptually motivated station-
hearing system analyses a broad spectrum in parts corre-
ary wavelet packet filterbank, noise estimation approach,
sponding to CBs. Thus for wavelet based speech processing,
and spectral over-subtraction algorithm. Section 3 describes
the wavelet packet (WP) tree is often chosen. Here, for the
the experimental setup. The experimental results and dis-
stationary wavelet packet transform based decomposition,
cussion are presented in Sect. 4. Conclusions are drawn in
the WP tree is selected to approximate the CBs of the psy-
Sect. 5.
choacoustic model as close as possible, so that the signal can
be analyzed and processed in accordance to the perceptual
frequency scale.
2 Proposed speech enhancement method
The CBR scale is also known as Bark scale, where one
Bark is referred as the bandwidth of one critical band. Based
The block diagram of proposed speech enhancement method
on the measurements by Zwicker and Terhardt (1980), an
(PM) is shown in Fig. 1. The proposed method is based on
approximate analytical expression to describe the relation-
temporal and spectral processing and composed of the fol-
ship between linear frequency and CB number (in Bark) is
lowing steps (Upadhyay and Karmakar 2012):
2
(i) Firstly, the perceptually motivated stationary wavelet 0.76 × f f
z(f ) = 13 arctan + 3.5 arctan (1)
packet filterbank (PM-SWPFB) is used to decompose 1000 7500
Int J Speech Technol
Here, z is the CB number in Bark, and f is the linear fre- Within this sampling rate, there are approximately 17 CBs
quency in Hz. The corresponding critical bandwidth (CBW), as listed in Table 1.
refers to the non-uniform frequency response of the human For the sampling rate fs , the bandwidth of stationary
ear, of the center frequencies can be expressed by wavelet packet decomposition (SWPD) at j th level (Olhede
and Walden 2005; Walden and Contreras 1998) is
fc 2 0.69
CBW(fc ) = 25 + 75 1 + 1.4 (2) fs
1000 CBW(j, n) = (3)
2 +1
j
where fc is the center frequency in Hz. Theoretically, the Here CBW(j, n) represents the bandwidth corresponding to
range of human auditory frequency spreads from 20 Hz to node (j, n) in SWPT tree, J is the maximum number of
20 kHz and covers approximately 24 Barks. However, the levels j = 0, 1, 2, . . . , J ; n is the position of node or shift,
sampling rate of narrowband speech is chosen to be 8 kHz. where n = 0, 1, . . . , (2j − 1), and fs is the sampling rate.
Int J Speech Technol
Fig. 3 (a) The tree structure of the proposed PM-SWPFB, and (b) the bandwidths of the PM-SWPFB tree
According to the specifications of center frequencies 5 decomposition stages to approximate 17 CBs which corre-
(fc ), CBW (f ), lower (fl ), and upper (fu ) cut-off fre- spond to wavelet packet coefficient sets wj,m (k). The corre-
quencies are given in Table 1, the tree structure of the PM- sponding frequency bandwidth of the PM-SWPFB is shown
SWPFB can be constructed as shown in Fig. 3(a). It contains in Fig. 3(b). The resulting 17 subband PM-SWPFB of the
Int J Speech Technol
Bark scale and the CBW are plotted in Figs. 4(a) and 4(b), noise power spectral density at frame k and frequency ω;
respectively. and |Yi (ω, k)|2 is the noisy speech magnitude squared spec-
trum of subband noisy speech. λi (ω, k) is a time and fre-
2.2 Noise estimation quency dependent smoothing parameter and its value de-
pends on the noise changing rate.
The noise estimate has a major impact on the quality of en- The smoothing parameter at each subband i is denoted
hanced signal. If the noise estimate is too low, annoying by λi (ω, k) for frame k. In the recursive technique (Loizou
remnant noise will be audible, and if the noise estimate is 2007; Lin et al. 2002), the smoothing parameter is chosen
too high, speech will be distorted, possibly resulting in in- to be a sigmoid function which changes of the a-posteriori
telligibility loss (Loizou 2007). The simplest approach is to SNRi (ω, k), as
estimate and update the noise spectrum during the silence
segments of the signal using voice activity detection algo- 1
λi (ω, k) = (5)
rithm. But the drawback of this approach is that it works 1 + e(−a(SNRi (ω,k)−T ))
satisfactorily only in case of stationary noise, and does not Here the noise changing rate is affected by the parameter a
work well in more realistic environments. in sigmoid function (5). The value of a varies between the
There are many methods to estimate the noise power, es- range 1 to 6 at constant value of T . The parameter T in (5) is
pecially during speech activity. The noise power can be es- the center-offset of the transition curve in sigmoid function
timated using minimal-tracking algorithms (Martin 2001), and the value of T is around 4 to 5. The effect of a and T
and time recursive averaging algorithms (Cohen 2003; on the sigmoid function can be observed from Fig. 5, the
Doblinger 1995). It is obvious that the performance of the experimental conditions of which is given in Sect. 4.
estimation approach will influence the behavior of the al- The update of subband noise estimate must be performed
gorithm. For example in Martin (2001), a minimal-tracking only in the absence of speech at the corresponding frequency
algorithm is presented but the drawback of this algorithm is bin. This can be accomplished by controlling the smoothing
that the noise estimate increases whenever the noisy speech factor λi (ω, k) depending of the a-posteriori SNRi (ω, k) in
power increases. In the recursive averaging type of algo- ith subband
rithms (Cohen 2003; Doblinger 1995), the noise spectrum
is estimated as a weighted average of the past noise esti- |Yi (ω, k)|2
SNRi (ω, k) = 10 log10 1 m dB (6)
p=1 |D̂i (ω, k − p)|
mates and the present noisy speech spectrum. The weights 2
m
change adaptively depending on the effective SNR of each
frequency bin or the speech presence probability. In our The denominator of (6) is the average of the noise estimate
method, we have estimated and updated the noise spectrum of the previous m frames (numbering between 5 to 10) im-
in each subband, separately. The noise estimation for each mediately before the frame k.
subband using the first order relation as Theoretically the a-posteriori SNR should always be 1
when only noise is present and greater than 1 when both
D̂i (ω, k)2 = λi (ω, k)D̂i (ω, k − 1)2 speech and noise are present. The progression of the noise
2 estimation approach as given in (5) and (6) is given below.
+ 1 − λi (ω, k) Yi (ω, k) (4)
(i) If speech is present in frame k, the a-posteriori estimate
where, i is the subband number, k is the frame index, ω is SNRi (ω, k) will be large and therefore λi (ω, k) ≈ 1.
the frequency bin index, |D̂i (ω, k)|2 denotes the estimated Consequently, we will have |D̂i (ω, k)|2 ≡
Int J Speech Technol
|D̂i (ω, k − 1)|2 according to (4). The noise update will The input signal is decomposed into wavelet coefficients
cease and the noise estimate will remain the same as the of multiple subbands by the PM-SWPFB. Therefore, the ith
previous frame’s estimate. noisy subband signal is given by
(ii) If speech is absent in frame k, the a-posteriori estimate
SNRi (k) will be small and therefore λi (ω, k) ∼ = 0. As a yi (n) = si (n) + di (n) (8)
result, |D̂i (ω, k)|2 ≡ |Yi (ω, k)|2 and the noise estimate
where yi (n) is the output of the ith subband. Then, I-SOS
will follow the power of the noisy spectrum in the ab-
algorithm is applied in each subband, separately. The noise
sence of speech.
is estimated from each subband by using adaptive noise es-
The main advantage of using the time varying smoothing timation approach and over-subtraction factor is adjusted in
factor λi (ω, k), is that the noise estimation will adapt to each subband. Therefore, the estimate of the clean speech
different rates for the various frequency bins, depending spectrum in the ith subband is obtained by
on the estimate of the a-posteriori SNRi (ω, k) in the cor-
Ŝi (ω)2 = |Yi (ω)| − αi .|D̂i (ω)| , if |Ŝi (ω)| > 0 (9)
2 2 2
responding frequency bins. Thus, the approach described
β.|Yi (ω)|2 else
above has the potential to effectively suppress the noise in
each subband separately, and is more suited for enhance-
where ωi < ω < ωi+1 .
ment of speech degraded by non-stationary noise.
Here Yi (ω) is the short-time Fourier transform (STFT) of
the degraded speech, Ŝi (ω) is the STFT of enhanced speech
2.3 Spectral over-subtraction algorithm
an αi is the subband specific over-subtraction factor, which
is the function of the segmental SNR. The segmental SNR
Spectral over-subtraction algorithm, proposed by Berouti
of the ith subband can be calculated as
et al. (1979), is used for enhancement of speech degraded
ωi+1 2
ω=ωi |Yi (ω)|
by stationary noise. This algorithm gives the superior re-
sult from the classical spectral subtraction algorithm (Boll SNRi (ω) = 10 log10 ωi+1 dB (10)
ω=ωi |D̂i (ω)|
2
1979). In our method, an improved spectral over-subtraction
algorithm is applied in each subband separately, which are Here, ωi and ωi+1 are the start and end frequency bins of
obtained after the decomposition using PM-SWPFB. The the ith subband and the |D̂i (ω)|2 is estimated using (4). The
noise is estimated from each subband by using adaptive subband specific over-subtraction can be calculated as
noise estimation approach and over-subtraction factor is ad- ⎧
justed in each subband. The I-SOS algorithm used in the ⎪
⎪ αmax if SNRi ≤ SNRmin
⎨α αmin −αmax
proposed speech enhancement method is explained below: max + (SNRi − SNRmin )( SNRmax −SNRmin ),
αi = (11)
The noisy signal can be modeled as a sum of the clean ⎪
⎪ if SNRmin ≤ SNRi ≤ SNRmax
⎩
speech and the random noise as (Boll 1979) αmin if SNRi ≥ SNRmax
y(n) = s(n) + d(n), 0≤n≤N −1 (7) Figure 6 shows the relation between over-subtraction fac-
tor and segmental SNR. Here, the minimum and maximum
where n is the discrete time index; y(n), s(n), and d(n), are value of over-subtraction factor and segmental SNR are as
the nth sample of the discrete time signal of noisy speech, αmin = 1, αmax = 5, SNRmin = −5 dB, SNRmax = 20 dB,
clean speech, and noise respectively (Boll 1979), where s(n) respectively and α0 ≈ 4. In (9), the spectral floor parame-
is assumed to be uncorrelated to d(n). ter was set to β = 0.03.
Int J Speech Technol
SegSNR
M−1
1
= 10 log10
M
m=0
1 N 2 (n + N
n=0 s m)
× N
N
[dB]
n=0 {s(n + Nm ) − ŝ(n + Nm )}
1 2
N
Fig. 6 The relation between over-subtraction factor and segmental (13)
SNR
where M denotes the number of frames in a signal and
3 Enhancement experiments N denotes the number of samples per frame.
(c) Perceptual evaluation of speech quality: PESQ is an
3.1 Speech corpus and noise types
objective quality measure algorithm designed to predict
In our evaluations, we use the NOIZEUS corpus speech the subjective opinion score of a degraded audio sam-
database.1 NOIZEUS is composed of 30 phonetically bal- ple and it is recommended by ITU-T for speech quality
anced sentences belonging to six speakers (three males and assessment (Rix et al. 2001). PESQ (Rix et al. 2001)
three females). The corpus is sampled at 8 kHz and filtered is a fusion of two other perceptually motivated objec-
to simulate receiving frequency characteristics of telephone tive speech quality measures: perceptual analysis mea-
handsets. The NOIZEUS comes with non-stationary noises surement system (PAMS) and perceptual speech qual-
at different levels of SNR. In our evaluation, we use the non- ity measure (PSQM). PESQ produces robust estimates
stationary noises such as car, train and babble noises at dif- of speech quality in the presence of a wide range of
ferent SNR levels (i.e. 0 dB, 5 dB, 10 dB, and 15 dB); and noise types. PESQ prediction maps mean opinion score
two English sentences sp10, and sp12 produced by a male (MOS) estimates to a range between 0.5 and 4.5, where
speaker and a female speaker. We also generate correspond- 1.0 corresponds to bad and 4.5 corresponds to distortion
ing stimuli set degraded by stationary additive white Gaus- less. In our evaluation, we compute mean PESQ scores
sian noise (AWGN) at varying SNR levels (i.e. 0 dB to 15 dB over a subset of the NOIZEUS corpus.
in steps of 5 dB). In addition to objective measures, we employ spec-
trogram analysis with informal subjective listening tests.
3.2 Evaluation methods (d) Spectrograms: the objective measures (SNR and
SegSNR) cannot easily quantify the amount of rem-
For evaluation purposes, we employ an objective speech
nant noise in the enhanced speech. Analyzing the time-
quality measures, namely, SNR, segmental SNR (SegSNR)
and the perceptual estimation of speech quality (PESQ) im- frequency distribution of the enhanced speech, and eval-
provement score. uating the structure of remnant noise, is particularly im-
portant. In the experiments, the speech spectrograms are
(a) Signal-to-noise ratio: SNR is the ratio of the total signal therefore observed, to yield more information about the
energy to the total noise energy in the utterance (Loizou remnant noise and speech distortion.
2007). The following equation is used for evaluation of
SNR results of enhanced speech signals
3.3 Experimental procedure
L 2
n=1 s (n)
SNR = 10 log10 L [dB] (12) For the wavelet based speech processing, the choice of
n=1 {s(n) − ŝ(n)}
2
mother wavelet function or wavelet filter is important for the
where s(n) denotes original signal, ŝ(n) denotes en- frequency selectivity. In addition, the computational com-
hanced signal, n is the sample index, and L is the num- plexity of the stationary wavelet packet filterbank (SWPFB)
ber of samples in both signals. The summation is per- is directly dependent on the length of the wavelet filter.
formed over the signal length. In the proposed method (PM), the orthogonal wavelet fil-
ters including Daubechies (‘DbN’), Symlets (‘symN’), and
1A Noisy Speech Corpus for Assessment of Speech Enhancement Al- Cioflets (‘coifN’) are considered in the PM-SWPFB, and
gorithms. http://www.utdallas.edu/~loizou/speech/noizeus/. their relative performance compared. In order to select an
Int J Speech Technol
Filter length 8 16 20 24 28 24 28 32 18 30
ER 5.22 5.23 5.29 5.28 5.27 5.29 5.25 5.29 5.26 5.27
CPU time (s) 4.48 4.66 4.79 4.72 4.95 4.77 5.03 5.24 5.17 5.18
ER/CPU time 1.16 1.12 1.10 1.12 1.06 1.11 1.04 1.00 1.02 1.01
appropriate wavelet filter for the proposed speech enhance- The objective speech quality measures, namely SNR,
ment method from these wavelet filters, an experiment is SegSNR, and mean PESQ improvement scores for the four
performed, the result of which is listed in Table 2. different noise types (three non-stationary and a stationary)
The enhancement rate (ER), as tabulated in Table 2 is de- at different SNR levels are investigated in our experiments
fined as, are shown in Table 3. These measures show that the pro-
M posed speech enhancement method performs better in the
i=1 {SNR[ŝi (n)] − SNR[xi (n)]} case of non-stationary noises (at SNR ≥ 5 dB) and station-
ER = (14)
M ary noise (at SNR ≥ 10 dB) with other popular algorithms.
For sp10.wav sentence produced by a male speaker, the
where M denotes the number of test sentences used in the temporal waveforms and speech spectrograms analysis of
experiment. Here, SNR[ŝi (n)], and SNR[xi (n)] denote the clean, degraded and enhanced speech signals are shown in
SNR of the ith enhanced speech and the original speech, Figs. 7, 8, 9 and 10. The enhanced speech for the car, train,
respectively. In our experiment, 4 (M = 4) speech sentences and babble does not exhibit speech distortion and the back-
degraded by different real-world noises and white Gaussian ground noise is suppressed. In white noise case, though the
noise at different SNR levels are used. background noise is suppressed, a small amount of signal
Considering the enhancement rate as well as the compu- distortion is also introduced. We have also conducted infor-
tational complexity given in Table 2, the 4 point Daubechies mal listening tests where the listeners are provided with the
(Db) wavelet filter with length 8, has the best ER/CPU time clean signal, the degraded signal, and the enhanced signal.
ratio, and has been taken in our method. Therefore, for For the car, train and babble noises case in particular, we
the evaluation of proposed approach described in Sect. 2, found the remnant musical noise present in the enhanced
Daubechies 4 (Db4) mother wavelet is used as a perfect re- speech to be non-distracting and easy to ignore.
construction wavelet filters for SWPD and perceptually mo- Also, we have presented the temporal waveforms and
tivated filterbank design, shown in Fig. 3(a). spectrograms analysis for sp12.wav sentence produced by
In our experiments, we zero mean and normalize samples female and degraded by car noise at 10 dB SNR against
of each of the sentence files to be between −1.0 and +1.0. three popular speech enhancement techniques in Fig. 11.
The STFT parameters used for subband speech enhance- The mean PESQ results are shown in Fig. 11. The proposed
ment are, the sampling rate is 8000 samples/s, the frame method achieves results comparable to these three methods.
duration is set to 32 ms (256 samples) and the frame shift
to 50 % (16 ms or 128 samples). The Hamming window is
used as the analysis window. We employ FFT length of 256 5 Conclusion
samples. The noise estimate is updated adaptively and con-
In this paper, a speech enhancement method is presented
tinuously using the smoothing parameter (5). For calculation
which integrates a perceptually motivated stationary wavelet
of smoothing parameter, the value of a and T is chosen to
packet filterbank (PM-SWPFB) and the improved spectral
be 4 and 5 (as Fig. 5), respectively in sigmoid function (5).
over-subtraction (I-SOS) together for enhancement of de-
graded speech. The PM-SWPFB firstly decomposes the
noisy speech into subbands as per critical band rate scale
4 Experimental results and discussion of the human auditory system and then improved spectral
over-subtraction algorithm is applied in each subband. The
This section presents the comparative study of speech en- advantage of stationary wavelet packet transform (SWPT) is
hancement results of proposed method (PM) with the en- that it offers adequate frequency resolution for designing the
hancement results of other three popular frequency domain PM-SWPFB. The I-SOS algorithm uses an adaptive noise
algorithms, such as, basic spectral subtraction (SS) (Boll estimation approach to estimate the noise from each sub-
1979), SOS (Berouti et al. 1979), and multi-band spectral band. The adaptive noise estimation approach does not re-
subtraction (MBSS) (Kamath and Loizou 2002) algorithms. quire voice activity detector and works continuously even in
Int J Speech Technol
Fig. 7 Temporal waveform and speech spectrograms of sp10.wav ut- (a) clean speech; (b) speech degraded by car noise (10 dB SNR);
terance, “The sky that morning was clear and bright blue”, by a male (c) speech enhanced by using SS; (d) speech enhanced by using SOS;
speaker from the NOIZEUS speech corpus: (From top to bottom) (e) speech enhanced by using MBSS; (f) speech enhanced by using PM
Int J Speech Technol
Fig. 8 Temporal waveform and speech spectrograms of sp10.wav ut- (a) clean speech; (b) speech degraded by train noise (10 dB SNR);
terance, “The sky that morning was clear and bright blue”, by a male (c) speech enhanced by using SS; (d) speech enhanced by using SOS;
speaker from the NOIZEUS speech corpus: (From top to bottom) (e) speech enhanced by using MBSS; (f) speech enhanced by using PM
Int J Speech Technol
Fig. 9 Temporal waveform and speech spectrograms of sp10.wav ut- (a) clean speech; (b) speech degraded by babble noise (10 dB SNR);
terance, “The sky that morning was clear and bright blue”, by a male (c) speech enhanced by using SS; (d) speech enhanced by using SOS;
speaker from the NOIZEUS speech corpus (From top to bottom): (e) speech enhanced by using MBSS; (f) speech enhanced by using PM
Int J Speech Technol
Fig. 10 Temporal waveform and speech spectrograms of sp10.wav (a) clean speech; (b) speech degraded by white noise (10 dB SNR);
utterance, “The sky that morning was clear and bright blue”, by a (c) speech enhanced by using SS; (d) speech enhanced by using SOS;
male speaker from the NOIZEUS speech corpus (From top to bottom): (e) speech enhanced by using MBSS; (f) speech enhanced by using PM
Int J Speech Technol
Fig. 11 Temporal waveform and speech spectrograms of sp12.wav noise (10 dB SNR) (PESQ = 2.043); (c) speech enhanced by using SS
utterance, “The drip of the train made a pleasant sound”, by a fe- (PESQ = 1.782); (d) speech enhanced by using SOS (PESQ = 2.242);
male speaker from the NOIZEUS speech corpus (From top to bot- (e) speech enhanced by using MBSS (PESQ = 2.005); (f) speech en-
tom): (a) clean speech (PESQ = 4.5); (b) speech degraded by car hanced by using PM (PESQ = 2.341)
Int J Speech Technol
Table 3 SNR, SegSNR and PESQ score results of noisy and enhanced speech signals at (0 dB, 5 dB, 10 dB, 15 dB) SNRs. English sentence
sp10.wav produced by a male speaker is used as original speech signal
Car SS 0.95 1.40 1.59 1.71 0.86 1.17 1.28 1.33 1.56 1.90 2.15 2.21
SOS 2.74 5.75 9.86 14.53 2.71 9.02 12.68 15.78 1.62 2.06 2.43 2.67
MBSS 2.75 7.05 9.61 12.85 2.68 6.86 9.39 12.59 1.49 1.98 2.25 2.60
PM 4.39 9.42 12.89 16.01 4.30 9.19 14.34 19.28 1.64 2.16 2.57 2.83
Train SS 1.24 1.43 1.59 1.67 1.12 1.20 1.29 1.32 1.87 1.66 2.07 2.15
SOS 2.72 5.70 9.81 14.53 4.22 7.81 11.73 15.53 1.72 1.84 2.29 2.62
MBSS 4.07 5.73 9.55 11.78 3.88 5.53 9.35 11.59 1.51 1.69 2.12 2.38
PM 5.54 8.04 11.91 15.74 5.34 8.76 13.89 19.26 1.89 1.88 2.39 2.73
Babble SS 1.47 1.52 1.63 1.74 1.19 1.21 1.29 1.35 1.48 1.92 2.11 2.21
SOS 2.74 5.75 9.90 14.53 2.57 7.05 11.33 14.51 1.71 2.12 2.38 2.66
MBSS 2.66 5.95 9.54 11.81 2.52 5.74 9.31 11.52 1.81 2.20 2.39 2.65
PM 2.66 7.34 11.52 14.74 4.06 8.32 14.41 19.26 1.90 2.21 2.56 2.69
White SS 1.42 1.59 1.70 1.78 1.18 1.31 1.34 1.38 1.66 1.95 2.08 2.15
SOS 6.98 10.3 13.65 16.67 6.75 10.1 13.43 16.45 1.91 2.23 2.42 2.68
MBSS 6.10 8.80 11.93 13.46 5.90 8.63 11.77 13.26 1.65 1.97 2.30 2.56
PM 4.84 9.97 15.18 20.20 4.70 9.74 14.97 20.03 1.80 2.14 2.48 2.80
the presence of speech and also provides exact results even Ephraim, Y. (1992). Statistical-model-based speech enhancement sys-
at very low SNR. tems. Proceedings of the IEEE, 80(10), 1526–1555.
Ephraim, Y., & Cohen, I. (2006). Recent advancements in speech en-
The performance assessment of the proposed method is
hancement. In The electrical engineering handbook (pp. 12–26).
analyzed using an objective speech quality measures (SNR, Boca Raton: CRC Press. Chap. 5.
SegSNR, PESQ score), and spectrogram with informal lis- Ephraim, Y., Ari, H. L., & Roberts, W. (2006). A brief survey of speech
tening tests. The results showed that the proposed method enhancement. In The electrical engineering handbook (3rd ed.).
not only reduces background noise but also prevent speech Boca Raton: CRC Press.
Kamath, S., & Loizou, P. (2002). A multi-band spectral subtraction
from remnant noise and quality deterioration, especially at method for enhancing speech corrupted by colored noise. In Pro-
the low SNR noise cases and gives consistently good re- ceedings of int. conf. on acoustics, speech, and signal processing,
sults; also has a performance better than the other spectral Orlando, USA, May 2002.
subtractive-type speech enhancement algorithms. Lim, J. S., & Oppenheim, A. V. (1979). Enhancement and bandwidth
compression of noisy speech. Proceedings of the IEEE, 67, 1586–
1604.
Lin, L., Ambikairajah, E., & Holmes, W. H. (2002). Speech enhance-
References ment for non-stationary noise environment. In Proceedings of
IEEE Asia pacific conf. on circuits and systems, Oct. 2002 (Vol. 1,
Berouti, M., Schwartz, R., & Makhoul, J. (1979). Enhancement of pp. 177–180).
speech corrupted by acoustic noise. In Proceedings of int. conf. on Loizou, P. C. (2007). Speech enhancement: theory and practice (1st
acoustics, speech, and signal processing, Washington, DC, April ed.). London: Taylor & Francis.
1979 (pp. 208–211). Lu, C. T., & Wang, H. C. (2004). Speech enhancement using perceptu-
Boll, S. F. (1979). Suppression of acoustic noise in speech using spec- ally constrained gain factors in critical band wavelet packet trans-
tral subtraction. IEEE Transactions on Acoustics, Speech, and form. Electronics Letters, 40(6), 394–396.
Signal Processing, 27(2), 113–120. Mallat, S. (1989). A theory for multi-resolution signal decomposition:
Chen, S. H., & Wang, J. F. (2004). Speech enhancement using per- the wavelet representation. IEEE Transactions on Pattern Analy-
ceptual wavelet packet decomposition and teager energy operator. sis and Machine Intelligence, 11(7), 674–693.
Journal of VLSI Signal Processing, 36(2–3), 125–139. Mallat, S. (2009). A wavelet tour of signal processing: the sparse way
Cohen, I. (2003). Noise spectrum estimation in adverse environments: (3rd ed.). New York: Academic Press/United Press.
improved minima controlled recursive averaging. IEEE Transac- Martin, R. (2001). Noise power spectral density estimation based on
tions on Speech and Audio Processing, 11(5), 466–475. optimal smoothing and minimum statistics. IEEE Transactions on
Doblinger, G. (1995). Computationally efficient speech enhancement Speech and Audio Processing, 9, 504–512.
by spectral minima tracking in subbands. In Proceedings of euro Olhede, S., & Walden, A. T. (2005). A generalized demodulation ap-
speech (Vol. 2, pp. 1513–1516). proach to time frequency projections for multi-component signals.
Donoho, D. L. (1995). De-noising by soft-thresholding. IEEE Trans- Proceedings of the Royal Society A. Mathematical, Physical and
actions on Information Theory, 41(3), 613–627. Engineering Sciences, 461, 2159–2179.
Int J Speech Technol
O’Shaughnessy, D. (2007). Speech communications: human and ma- vironments. In Proceedings of IEEE int. conf. on intelligent hu-
chine (2nd ed.). Hyderabad: University Press (India) Pvt. Ltd. man computer interaction, IIT Khargpur, India, Dec 27–29 (pp.
Rix, A., Beerends, J., Hollier, M., & Hekstra, A. (2001) Perceptual 472–478).
evaluation of speech quality (PESQ)—a new method for speech Virag, N. (1999). Single channel speech enhancement based on mask-
quality assessment of telephone networks and codecs. In Proceed- ing properties of the human auditory system. IEEE Transactions
ings of IEEE int. conf. on acoustics, speech, and signal process- on Speech and Audio Processing, 7, 126–137.
ing, Salt Lake City, UT (Vol. 2, pp. 749–752). Walden, A. T., & Contreras, C. (1998). The phase-corrected undec-
Strang, G., & Nguyen, T. (1996). Wavelets and filter banks. Wellesley: imated discrete wavelet packet transform and its application to
Wellesley-Cambridge Press. interpreting the timing of events. Proceedings of the Royal So-
Upadhyay, N., & Karmakar, A. (2012). The spectral subtractive-type ciety A. Mathematical, Physical and Engineering Sciences, 454,
algorithms for enhancing speech in noisy environments. In Pro- 2243–2266.
ceedings of IEEE int. conf. on recent advances in information Zwicker, E., & Terhardt, E. (1980). Analytical expressions for critical
technology, I.S.M. Dhanbad, India, March 15–17 (pp. 841–847). band rate and critical bandwidth as a function of frequency. The
Upadhyay, N., & Karmakar, A. (2012). A perceptually motivated Journal of the Acoustical Society of America, 68, 1523–1525.
stationary wavelet filterbank utilizing improved spectral over-
subtraction algorithm for enhancing speech in non-stationary en-