Cocktail Party Processing Via Structured Prediction: Yuxuan Wang, Deliang Wang
Cocktail Party Processing Via Structured Prediction: Yuxuan Wang, Deliang Wang
Cocktail Party Processing Via Structured Prediction: Yuxuan Wang, Deliang Wang
Abstract
1 Introduction
The cocktail party problem, or the speech separation problem, is one of the central problems in
speech processing. A particularly difficult scenario is monaural speech separation, in which mix-
tures are recorded by a single microphone and the task is to separate the target speech from its
interference. This is a severely underdetermined figure-ground separation problem, and has been
studied for decades with limited success.
Researchers have attempted to solve the monaural speech separation problem from various angles.
In signal processing, speech enhancement (e.g., [1, 2]) has been extensively studied, and assump-
tions regarding the statistical properties of noise are crucial to its success. Model-based methods
(e.g., [3]) work well in constrained environments, and source models need to be trained in advance.
Computational auditory scene analysis (CASA) [4] is inspired by how human auditory system func-
tions [5]. CASA has the potential to deal with general acoustic environments but existing systems
have limited performance, particularly in dealing with unvoiced speech.
Recent studies suggest a new formulation to the cocktail party problem, where the focus is to clas-
sify whether a time-frequency (T-F) unit is dominated by the target speech [6]. Motivated by this
viewpoint, we propose to approach the monaural speech separation problem via structured predic-
tion. The use of structured predictors, as opposed to binary classifiers, is motivated by temporal
dynamics in speech signal. Our study makes the following contributions: (1) we demonstrate that
modeling temporal dynamics via structured prediction can significantly improve separation; (2) to
capture nonlinearity, we propose a new structured prediction model that makes use of the discrimi-
native feature learning power of deep neural networks; and (3) instead of classification accuracy, we
show how to directly optimize a measure that is well correlated with human speech intelligibility.
1
2 Separation as binary classification
We aim to estimate a time-frequency matrix called the ideal binary mask (IBM). The IBM is a binary
matrix constructed from premixed target and interference, where 1 indicates that the target energy
exceeds the interference energy by a local signal-to-noise (SNR) criterion (LC) in the corresponding
T-F unit, and 0 otherwise. The IBM is defined as:
1, if SN R(t, f ) > LC
IBM (t, f ) =
0, otherwise,
where SN R(t, f ) denotes the local SNR (in decibels) within the T-F unit at time t and frequency
f . We adopt the common choice of LC = 0 in this paper [7]. Despite its simplicity, adopting the
IBM as a computational objective offers several advantages. First, the IBM is directly based on
the auditory masking phenomenon whereby a stronger sound tends to mask a weaker one within a
critical band. Second, unlike other objectives such as maximizing SNR, it is well established that
large human speech intelligibility improvements result from IBM processing, even for very low SNR
mixtures [7–9]. Improving human speech intelligibility is considered as a gold standard for speech
separation. Third, IBM estimation naturally leads to classification, which opens the cocktail party
problem to a plethora of machine learning techniques.
We propose to formulate IBM estimation as binary classification as follows, which is a form of
supervised learning. A sound mixture with the 16 kHz sampling rate is passed through a 64-channel
gammatone filterbank spanning from 50 Hz to 8000 Hz on the equivalent rectangular bandwidth
rate scale. The output from each filter channel is divided into 20-ms frames with 10-ms frame shift,
producing a cochleagram [4]. Due to different spectral properties of speech, a subband classifier
is trained for each filter channel independently, with the IBM providing training labels. Acoustic
features for each subband classifier are extracted from T-F units in the cochleagram. The target
speech is separated by binary weighting of the cochleagram using the estimated IBM [4].
Several recent studies have attempted to directly estimate the IBM via classification. By employing
Gaussian mixture models (GMMs) as classifiers and amplitude modulation spectrograms (AMS)
as features, Kim et al. [10] show that estimated masks can improve human speech intelligibility in
noise. Han and Wang [11] have improved Kim et al.’s system by employing support vector machines
(SVMs) as classifiers. Wang et al. [12] propose a set of complementary acoustic features that shows
further improvements over previous systems. The complementary feature is a concatenation of
AMS, relative spectral transform and perceptual linear prediction (RASTA-PLP), mel-frequency
cepstral coefficients (MFCC), and pitch-based features.
Because the ratio of 1’s to 0’s in the IBM is often skewed, simply using classification accuracy as
the evaluation criterion may not be appropriate. Speech intelligibility studies [9, 10] have evaluated
the influence of the hit (HIT) and false-alarm (FA) rate on intelligibility scores. The difference, the
HIT−FA rate, is found to be well correlated to human speech intelligibility in noise [10]. The HIT
rate is the percent of correctly classified target-dominant T-F units (1’s) in the IBM, and the FA rate
is the percent of wrongly classified interference-dominant T-F units (0’s). Therefore, it is desirable
to design a separation algorithm that maximizes HIT−FA of the output mask.
3 Proposed system
Dictated by speech production mechanisms, the IBM contains highly structured, rather than, random
patterns. Previous systems do not explicitly model such structure. As a result, temporal dynamics,
which is a fundamental characteristic of speech, is largely ignored in previous work. Separation
systems accounting for temporal dynamics exist. For example, Mysore et al. [13] incorporate tem-
poral dynamics using HMMs. Hershey et al. [14] consider different levels of dynamic constraints.
However, these works do not treat separation as classification. Contrary to standard binary clas-
sifiers, structured prediction models are able to model correlations in the output. In this paper, we
treat unit classification at each filter channel as a sequence labeling problem and employ linear-chain
conditional random fields (CRFs) [15] as subband classifiers.
2
3.1 Conditional random fields
Different from HMM, a CRF is a discriminative model and does not need independence assumptions
of features, making it more suitable to our task. A CRF models the posterior probability P (y|x) as
follows. Denoting y as a label sequence and x as an input sequence,
P T
exp t w f (y, x, t)
P (y|x) = . (1)
Z(x)
P P T ′
Here t indexes time frames, w is the parameters to learn, and Z(x) = y′ exp t w f (y , x, t)
is the partition function. f is a vector-valued feature function associated with each local site (T-F
unit in our task), and often categorized into state feature functions s(yt , x, t) and transition feature
functions t(yt−1 , yt , x, t). State feature functions define the local discriminant functions for each
T-F unit and transition feature functions capture the interaction between neighboring labels. We
assume a linear-chain setting and the first-order Markovian property, i.e., only interactions between
two neighboring units in time are modeled. In our task, we can simply use acoustic feature vectors
in each T-F unit as state feature functions and their concatenations as transition feature functions:
s(yt , x, t) = [δ(yt =0) xt , δ(yt =1) xt ]T , (2)
T
t(yt−1 , yt , x, t) = [δ(yt−1 =yt ) zt , δ(yt−1 6=yt ) zt ] , (3)
T
where δ is the indicator function and zt = [xt−1 , xt ] . Equation (3) essentially encodes temporal
continuity in the IBM. To simplify notations, all feature functions are written as f (yt−1 , yt , x, t) in
the remainder of the paper.
Training is for estimating w, and
is usually done by maximizing the conditional log-likelihood on a
training set T = x(m) , y(m) , i.e., we seek w by
X
max log p(y(m) |x(m) , w) + R(w), (4)
w
m
where m is the index of a training sample, and R(w) is a regularizer of w (we use ℓ2 in this paper).
For gradient ascent, a popular choice is the limited-memory BFGS (L-BFGS) algorithm [16].
A CRF is a log-linear model, which has only linear modeling power. As acoustic features are
generally not linearly separable, the direct use of CRFs unlikely produces good results. In the
following, we propose a method to transform the standard CRF into a nonlinear sequence classifier.
We employ pretrained deep neural networks (DNNs) to capture nonlinearity between input and
output. DNNs have received widespread attention since Hinton et al.’s paper [17]. DNNs can be
viewed as hierarchical feature detectors that learn increasingly complex feature mappings as the
number of hidden layers increases. To deal with problems such as vanishing gradients, Hinton et
al. suggest to first pretrain a DNN using a stack of restricted Boltzmann machines (RBMs) in a
unsupervised and layerwise fashion. The resulting network weights are then supervisedly finetuned
by backpropagation.
We first train DNN in the standard way to classify speech dominance in each T-F unit. After pre-
training and supervised finetuning, we then take the last hidden layer representations from the DNN
as learned features to train the CRF. In a discriminatively trained DNN, the weights from the last
hidden layer to the output layer would define a linear classifier, hence the last hidden layer represen-
tations are more amenable to linear classification. In other words, we replace x by h in equations
(1)-(4), where h represents the learned hidden features. This way CRFs would greatly benefit from
the nonlinear modeling power of deep architectures.
To better encode local contextual information, we could use a window (across both time and fre-
quency) of learned features to label the current T-F unit. A more parsimonious way is to use a
window of posteriors estimated by DNNs as features to train the CRF, which can dramatically re-
duce the dimensionality. We note in passing that the correlations across both time and frequency
can also be encoded at the model level, e.g., by using grid-structured CRFs. However the decoding
algorithm may substantially increase the computational complexity of the system.
3
We want to point out that an important advantage of using neural networks for feature learning is
its efficiency in the test phase; once trained, the nonlinear feature extraction of DNN is extremely
fast (only involves forward pass). This is, however, not always true for other methods. For exam-
ple, sparse coding may need to solve a new optimization problem to get the features. Test phase
efficiency is crucial for real-time implementation of a speech separation system.
There is related work on developing nonlinear sequence classifiers in the machine learning commu-
nity. For example, van der Maaten et al. [18] and Morency et al. [19] consider incorporating hidden
variables into the training and inference in CRF. Peng et al. [20] investigate a combination of neural
networks and CRFs. Other related studies include [21] and [22]. The proposed model differs from
the previous methods in that (1) discriminatively trained deep architecture is used, and/or (2) a CRF
instead of a Viterbi decoder is used on top of a neural network for sequence labeling, and/or (3)
nonlinear features are also used in modeling transitions. In addition, the use of a contextual window
and the change of the objective function discussed in the next subsection is specifically tailored to
the speech separation problem.
As argued before, it is desirable to train a classifier to maximize the HIT−FA rate of the estimated
mask. In this subsection, we show how to change the objective function and efficiently calculate the
gradients in CRF. Since subband classifiers are used, we aim to maximize the channelwise HIT−FA.
Denote the output label asPut ∈ {0,P1} and the
Ptrue label as yPt ∈ {0, 1}. The per utterance HIT−FA
rate can be expressed as t ut yt / t yt − t ut (1 − yt )/ t (1 − yt ), where the first term is the
HIT rate and the second the FA rate. To make the objective function differentiable, we replace ut by
the marginal probability p(yt = 1|x), hence we seek w by maximizing the HIT−FA on a training
set:
(m) (m) (m) (m)
!
= 1|x(m) , w)yt = 1|x(m) , w)(1 − yt )
P P P P
m t p(yt m t p(yt
max P P (m) − P P (m)
. (5)
t yt t (1 − yt )
w
m m
Clearly, computing the gradient of (5) boils down to computing the gradient of the marginal. A
speech utterance (sentence) typically spans several hundreds of time frames, therefore numerical
stability is critically important in our task. As can be seen later, computing the gradient of the
marginal requires the gradient of forward/backward scores. We adopt Rabiner’s scaling trick [23]
used in HMM to normalize the forward/backward score at each time point. Specifically, define
α(t, u) and β(t, u) as the forward and
P backward score of label u at time t, respectively. We nor-
malize the forward score such that u α(t, u) = 1, and use the resulting scaling to normalize the
backward score. Defining potential function φt (v, u) = exp wT f (v, u, x, t) , the recurrence of the
normalized forward/backward score is written as,
X
α(t, u) = α(t − 1, v)φt (v, u)/s(t), (6)
v
X
β(t, u) = β(t + 1, v)φt (u, v)/s(t + 1), (7)
v
P P Q
where s(t) = u v α(t − 1, v)φt (v, u). It is easy to show that Z(x) = t s(t), and now
the marginal has a simpler form of p(yt |x, w) = α(t, yt )β(t, yt ). Therefore, the gradient of the
marginal is,
∂p(yt |x, w)
= Gα (t, yt )β(t, yt ) + α(t, yt )Gβ (t, yt ), (8)
∂w
where Gα and Gβ are the gradients of the normalized forward and backward score, respectively.
Due to score normalization, Gα and P Gβ will very unlikely overflow. We now show that Gα can be
calculated recursively. Let q(t, u) = v α(t − 1, v)φt (v, u), we have
∂q(t,u) P P ∂q(t,v)
∂α(t, u) ∂w v q(t, v) − v ∂w q(t, u)
Gα (t, u) = = 2 , (9)
∂w
P
( v q(t, v))
and,
∂q(t, u) X X
= Gα (t − 1, v)φt (v, u) + α(t − 1, v)φt (v, u)f (v, u, x, t). (10)
∂w v v
4
95 95 95
DNN DNN DNN
DNN* DNN* DNN*
90 DNN−CRF 90 DNN−CRF 90 DNN−CRF
DNN−CRF* DNN−CRF* DNN−CRF*
85 85 85
HIT−FA
HIT−FA
HIT−FA
80 80 80
75 75 75
70 70 70
65 65 65
−10 dB −5 dB 0 dB −10 dB −5 dB 0 dB −10 dB −5 dB 0 dB
80 80 70
HIT−FA
HIT−FA
HIT−FA
65
75 75
60
70 70 55
50
65 65
45
60 60 40
−10 dB −5 dB 0 dB −10 dB −5 dB 0 dB −10 dB −5 dB 0 dB
Figure 1: HIT−FA results. (a)-(c): matched-noise test condition; (d)-(f): unmatched-noise test
condition.
90 90 90
85 85
80
80
80
70
75
HIT−FA
HIT−FA
HIT−FA
75
70 60
70
65
50
65
60
DNN DNN DNN
60 DNN* DNN* 40 DNN*
55
DNN−CRF DNN−CRF DNN−CRF
DNN−CRF* DNN−CRF* DNN−CRF*
55 50 30
0 10 20 30 40 50 60 0 10 20 30 40 50 60 0 10 20 30 40 50 60
Channel Channel Channel
(a) overall (b) voiced speech intervals (c) unvoiced speech intervals
The derivation of Gβ is similar, thus omitted. The time complexity of calculating Gα and Gβ is
O(L|S|2 ), where L and |S| are the utterance length and the size of the label set, respectively. This
is the same as the forward-backward recursion.
The objective function in (5) is not concave. Since high accuracy correlates with high HIT−FA, a
safe practice is to use a solution from (4) as a warm start for the subsequent optimization of (5). For
feature learning, DNN is also trained using (5) in the final system. The gradient calculation is much
simpler due to the absence of transition features. We found that L-BFGS performs well and shows
fast and stable convergence for both feature learning and CRF training.
4 Experimental results
4.1 Experimental setup
Our training and test sets are primarily created from the IEEE corpus [24] recorded by a single fe-
male speaker. This enables us to directly compare with previous intelligibility studies [10], where
the same speaker is used in training and testing. The training set is created by mixing 50 utter-
ances with 12 noises at 0 dB. To create the test set, we choose 20 unseen utterances from the same
speaker. First, the 20 utterances are mixed with the previous 12 noises to create a matched-noise test
5
60 60 60
50 50 50
40 40 40
Channel
Channel
Channel
30 30 30
20 20 20
10 10 10
20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120 140 160 180 200 220 20 40 60 80 100 120 140 160 180 200 220
Frame Frame Frame
(a) Ideal binary mask (b) DNN-CRF∗ -P mask (c) DNN mask
Figure 3: Masks for a test utterance mixed with an unseen crowd noise at 0 dB. White represents 1’s
and black represents 0’s.
condition, then 5 unseen noises to create a unmatched-noise test condition. The test noises1 cover
a variety of daily noises and most of them are highly non-stationary. In each frequency channel,
there are roughly 150,000 and 82,000 T-F units in the training and test set, respectively. Speaker-
independent experiments are presented in Section 4.4.
The proposed system is called DNN-CRF or DNN-CRF∗ if it is trained to maximize HIT−FA. We
use suffix R and P to distinguish training features for CRF, where R stands for learned features with-
out a context window (features are learned from the complementary acoustic feature set mentioned
in Section 2) and P stands for a window of posterior features. We use a two hidden layer DNN as
it provides a good trade-off between performance and complexity, and use a context window span-
ning 5 time frames and 17 frequency channels to construct the posterior feature vector. We use the
cross-entropy objective function for training the standard DNN in comparisons.
In this subsection, we show the effect of directly maximizing the HIT−FA rate. To evaluate the
contribution from the change of the objective alone, we use ideal pitch in the following experiments
to neutralize pitch estimation errors. The models are trained on 0 dB mixtures. In addition to 0 dB,
we also test the trained models on -10 and -5 dB mixtures. Such a test setting not only allows us
to measure the system’s generalization to different SNR conditions, but also to show the effects of
HIT−FA maximization on estimating sparse IBMs. We compare DNN-CRF∗ -R with DNN, DNN∗
and DNN-CRF-R, and the results are shown in Figure 1 and 2.
We document HIT−FA rates on three levels: overall, voiced intervals (pitched frames) and unvoiced
intervals (unpitched frames). Voicing boundaries are determined using ideal pitch. Figure 1 shows
the results for both matched-noise and unmatched-noise test conditions. First, comparing the perfor-
mances of DNN-CRFs and DNNs, we can see that modeling temporal continuity always improves
performance. It also seems very helpful for generalization to different SNRs. In the matched con-
dition, the improvement by directly maximizing HIT−FA is most significant in unvoiced intervals.
The improvement becomes larger when SNR decreases. In the unmatched condition, as classifica-
tion becomes much harder, direct maximization of HIT−FA offers more improvements in all cases.
The largest HIT−FA improvement of DNN-CRF∗ -R over DNN is about 10.7% and 21.2% abso-
lute in overall and unvoiced speech intervals, respectively. For a closer inspection, Figure 2 shows
channelwise HIT−FA comparisons on the 0 dB test mixtures in the matched-noise test condition. It
is well known that unvoiced speech is indispensable for speech intelligibility but hard to separate.
Due to the lack of harmonicity and weak energy, frequency channels containing unvoiced speech
often have significantly skewed distributions of target-dominant and interference-dominant units.
Therefore, an accuracy-maximizing classifier tends to output all 0’s to attain a high accuracy. As
an illustration, Figure 3 shows two masks for an utterance mixed with an unseen crowd noise at 0
dB using DNN and DNN-CRF∗ -P respectively. The two estimated masks achieve similar accuracy
around 90%. However, it is clear that the DNN mask misses significant portions of unvoiced speech,
e.g., between frame 30-50 and 220-240.
1
Test noises are: babble, bird chirp, crow, cocktail party, yelling, clap, rain, rock music, siren, telephone,
white, wind, crowd, fan, speech shaped, traffic, and factory noise. The first 12 are used in training.
6
Table 1: Performance comparisons between different systems. Boldface indicates best result
Matched-noise condition Unmatched-noise condition
System
Accuracy HIT−FA SNR (dB) SegSNR (dB) Accuracy HIT−FA SNR (dB) SegSNR (dB)
GMM [10] 77.4% 55.4% 10.2 7.3 65.9% 31.6% 6.8 1.9
SVM [11] 86.6% 68.0% 10.5 10.9 91.2% 64.1% 9.7 7.9
DNN 87.7% 71.6% 11.4 11.8 91.1% 66.2% 9.9 8.1
CRF 82.3% 59.8% 8.8 8.7 90.8% 64.0% 9.3 7.8
SVM-Struct 81.7% 58.6% 8.4 8.1 90.7% 63.5% 9.1 7.5
CNF 87.8% 71.7% 11.2 12.0 91.1% 66.9% 9.8 8.4
LD-CRF 86.3% 68.4% 9.7 10.5 91.1% 63.6% 8.9 7.8
DNN-CRF∗ -R 89.1% 75.6% 12.1 13.2 90.8% 70.2% 10.3 9.0
DNN-CRF∗ -P 89.9% 76.9% 12.0 13.5 91.1% 70.7% 10.0 8.9
Hendriks et al. [1] n/a n/a 4.6 0.5 n/a n/a 6.2 1.1
Wiener Filter [2] n/a n/a 3.7 -0.7 n/a n/a 5.6 -0.6
We systematically compare the proposed system with three kinds of systems on 0 dB mixtures:
binary classifier based, structured predictor based, and speech enhancement based. In addition to
HIT−FA, we also include classification accuracy, SNR and segmental SNR (segSNR) as alterna-
tive evaluation criteria. To compute SNRs, we use the target speech resynthesized from the IBM as
the ground truth signal for all classification-based systems. This way of computing SNRs is com-
monly adopted in the literature [4, 25], as the IBM represents the ground truth of classification. All
classification-based systems use the same feature set, but with estimated pitch, described in Section
2, except for Kim et al.’s GMM based system which uses AMS features [10]. Note that we fail
to produce reasonable results using the complementary feature set in Kim et al.’s system, possibly
because GMM requires much more training data than discriminative models for high dimensional
features. Results are summarized in Table 1.
We first compare with methods based on binary classifiers. These include two existing systems
[10, 11] and a DNN based system. Due to the variety of noises, classification is challenging even
in the matched-noise condition. It is clear that the proposed system significantly outperforms the
others in terms of all criteria. The improvement of DNN-CRF∗ s over DNN demonstrates the benefit
of modeling temporal continuity. It is interesting to see that DNN significantly outperforms SVM,
especially for unvoiced speech (not shown) which is important for speech intelligibility. We note
that without RBM pretraining, DNN performs significantly worse. Classification in the unmatched-
noise condition is obviously more difficult, as feature distributions are likely mismatched between
the training and the test set. Kim et al.’s system fails to generalize to different acoustic environments
due to substantially increased FA rates. The proposed system significantly outperforms SVM and
DNN, achieving about 71% overall HIT−FA and 10 dB SNR for unseen noises. Kim et al.’s system
has been shown to improve human speech intelligibility [10], it is therefore reasonable to project
that the proposed system will provide further speech intelligibility improvements.
We next compare with systems based on structured predictors, including CRF, SVM-Struct [26],
conditional neural fields (CNF) [20] and latent-dynamic CRF (LD-CRF) [19]. For fair compar-
isons, we use a two hidden layer CNF model with the same number of parameters as DNN-CRF∗ s.
Conventional structured predictors such as CRF and SVM-Struct (linear kernel) are able to explic-
itly model temporal dynamics, but only with linear modeling capability. Direct use of CRF turns
out to be much worse than using kernel SVM. Nevertheless, the performance can be substantially
7
boosted by adding latent variables (LD-CRF) or by using nonlinear feature functions (CNF and
DNN-CRF∗ s). With the same network architecture, CNF mainly differs from our model in two as-
pects. First, CNF does not use unsupervised RBM pretraining. Second, CNF only uses bias units in
building transition features. As a result, the proposed system significantly outperforms CNF, even
if CRF and neural networks are jointly trained in the CNF model. With better ability of encoding
contextual information, using a window of posteriors as features clearly outperforms single unit
features in terms of classification. It is worth noting that although SVM achieves slightly higher
accuracy in the unmatched-noise condition, the resulting HIT−FA and SNRs are worse than some
other systems. This is consistent with our analysis in Section 4.2.
Finally, we compare with two representative speech enhancement systems [1, 2]. The algorithm
proposed in [1] represents a recent state-of-the-art method and Wiener filtering [2] is one of the most
widely used speech enhancement algorithms. Since speech enhancement does not aim to estimate
the IBM, we compare SNRs by using clean speech (not the IBM) as the ground truth. As shown in
Table 1, the speech enhancement algorithms are much worse, and this is true of all 17 noises.
Due to temporal continuity modeling and the use of T-F context, the proposed system produces
masks that are smoother than those from the other systems (e.g., Figure 3). As a result, the outputs
seem to contain less musical noise.
Although the training set contains only a single IEEE speaker, the proposed system generalizes
reasonably well to different unseen speakers. To show this, we create a new test set by mixing 20
utterances from the TIMIT corpus [27] at 0 dB. The new test utterances are chosen from 10 different
female TIMIT speakers, each providing 2 utterances. We show the results in Table 2, and it is
clear that the proposed system generalizes better than existing ones to unseen speakers. Note that
significantly better performance and generalization to different genders can be obtained by including
the speaker(s) of interest into the training set.
Listening tests have shown that a high FA rate is more detrimental to speech intelligibility than a
high miss (or low HIT) [9]. The proposed classification framework affords us control over these two
quantities. For example, we could constrain the upper bound of the FA rate while still maximizing
the HIT rate. In this case, a constrained optimization should substitute (5). Our experimental results
(not shown due to lack of space) indicate that this can effectively remove spurious target segments
while still produce intelligible speech.
Being able to efficiently compute the derivative of marginals, in principle one could optimize a
class of objectives other than HIT−FA. These may include objectives concerning either speech in-
telligibility or quality, as long as the objective of interest can be expressed or approximated by a
combination of marginal probabilities. For example, we have tried to simultaneously minimize two
traditional CASA measures PEL and PN R (see e.g., [25]), where PEL represents the percent of tar-
get energy loss and PN R the percent of noise energy residue. Significant reductions in both measures
can be achieved compared to methods that maximize accuracy or conditional log-likelihood.
We have demonstrated that the challenge of the monaural speech separation problem can be ef-
fectively approached via structured prediction. Observing that the IBM exhibits highly structured
patterns, we have proposed to use CRF to explicitly model the temporal continuity in the IBM. This
linear sequence classifier is further transformed to a nonlinear one by using state and transition fea-
ture functions learned from DNN. Consistent with the results from speech perception, we train the
proposed DNN-CRF model to maximize a measure that is well correlated to human speech intel-
ligibility in noise. Experimental results show that the proposed system significantly outperforms
existing ones and generalizes better to different acoustic environments. Aside from temporal con-
tinuity, other ASA principles [5] such as common onset and co-modulation also contribute to the
structure in the IBM, and we will investigate these in future work.
Acknowledgements. This research was supported in part by an AFOSR grant (FA9550-12-1-0130), an STTR
subcontract from Kuzer, and the Ohio Supercomputer Center.
8
References
[1] R. Hendriks, R. Heusdens, and J. Jensen, “MMSE based noise PSD tracking with low complexity,” in
ICASSP, 2010.
[2] P. Scalart and J. Filho, “Speech enhancement based on a priori signal to noise estimation,” in ICASSP,
1996.
[3] S. Roweis, “One microphone source separation,” in NIPS, 2001.
[4] D. Wang and G. Brown, Eds., Computational Auditory Scene Analysis: Principles, Algorithms and Ap-
plications. Hoboken, NJ: Wiley-IEEE Press, 2006.
[5] A.S. Bregman, Auditory scene analysis: The perceptual organization of sound. The MIT Press, 1994.
[6] D. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Sepa-
ration by Humans and Machines, Divenyi P., Ed. Kluwer Academic, Norwell MA., 2005, pp. 181–197.
[7] D. Brungart, P. Chang, B. Simpson, and D. Wang, “Isolating the energetic component of speech-on-speech
masking with ideal time-frequency segregation,” J. Acoust. Soc. Am., vol. 120, pp. 4007–4018, 2006.
[8] M. Anzalone, L. Calandruccio, K. Doherty, and L. Carney, “Determination of the potential benefit of
time-frequency gain manipulation,” Ear and hearing, vol. 27, no. 5, pp. 480–492, 2006.
[9] N. Li and P. Loizou, “Factors influencing intelligibility of ideal binary-masked speech: Implications for
noise reduction,” J. Acoust. Soc. Am., vol. 123, no. 3, pp. 1673–1682, 2008.
[10] G. Kim, Y. Lu, Y. Hu, and P. Loizou, “An algorithm that improves speech intelligibility in noise for
normal-hearing listeners,” J. Acoust. Soc. Am., vol. 126, pp. 1486–1494, 2009.
[11] K. Han and D. Wang, “An SVM based classification approach to speech separation,” in ICASSP, 2011.
[12] Y. Wang, K. Han, and D. Wang, “Exploring monaural features for classification-based speech segrega-
tion,” IEEE Trans. Audio, Speech, Lang. Process., in press, 2012.
[13] G. Mysore and P. Smaragdis, “A non-negative approach to semi-supervised separation of speech from
noise with the use of temporal dynamics,” in ICASSP, 2011.
[14] J. Hershey, T. Kristjansson, S. Rennie, and P. Olsen, “Single channel speech separation using factorial
dynamics,” in NIPS, 2007.
[15] J. Lafferty, A. McCallum, and F. Pareira, “Conditional random fields: probabilistic models for segmenting
and labeling sequence data,” in ICML, 2001.
[16] J. Nocedal and S. Wright, Numerical optimization. Springer verlag, 1999.
[17] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation,
vol. 18, no. 7, pp. 1527–1554, 2006.
[18] L. van der Maaten, M. Welling, and L. Saul, “Hidden-unit conditional random fields,” in AISTATS, 2011.
[19] L. Morency, A. Quattoni, and T. Darrell, “Latent-dynamic discriminative models for continuous gesture
recognition,” in CVPR, 2007.
[20] J. Peng, L. Bo, and J. Xu, “Conditional neural fields,” in NIPS, 2009.
[21] A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” in NIPS workshop
on speech recognition and related applications, 2009.
[22] T. Do and T. Artieres, “Neural conditional random fields,” in AISTATS, 2010.
[23] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc.
IEEE, vol. 77, no. 2, pp. 257–286, 2003.
[24] IEEE, “IEEE recommended practice for speech quality measurements,” IEEE Trans. Audio Electroa-
coust., vol. 17, pp. 225–246, 1969.
[25] G. Hu and D. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,”
IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1135–1150, 2004.
[26] I. Tsochataridis, T. Hofmann, and T. Joachims, “Support vector machine for interdependent and structured
output spaces,” in ICML, 2004.
[27] J. Garofolo, DARPA TIMIT acoustic-phonetic continuous speech corpus, NIST, 1993.
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: