Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary

Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary
BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM

FOR POLYPHONIC SOUND EVENT DETECTION
Tomoki Hayashi1 , Shinji Watanabe2 , Tomoki Toda1 , Takaaki Hori2 , Jonathan Le Roux2 , Kazuya Takeda1
1
Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi 464-8603, Japan
2
Mitsubishi Electric Research Laboratories (MERL), 201 Broadway, Cambridge, MA 02139, USA
hayashi.tomoki@g.sp.m.is.nagoya-u.ac.jp,
{takeda,tomoki}@is.nagoya-u.ac.jp, {watanabe,thori,leroux}@merl.com
ABSTRACT [3, 4, 5]. In the NMF approaches, a dictionary of basis vectors

is learned by decomposing the spectrum of each single sound event
In this study, we propose a new method of polyphonic sound event into the product of a basis matrix and an activation matrix, then
detection based on a Bidirectional Long Short-Term Memory Hid- combining the basis matrices. The activation matrix at test time is
den Markov Model hybrid system (BLSTM-HMM). We extend the estimated using the basis vector dictionary. More recently, meth-
hybrid model of neural network and HMM, which achieved state- ods based on neural networks have achieved good performance
of-the-art performance in the field of speech recognition, to the for sound event classification and detection using acoustic signals
multi-label classification problem. This extension provides an ex- [7, 8, 9, 10, 11, 12]. In the first two of these studies [7, 8], the
plicit duration model for output labels, unlike the straightforward network was trained to be able to deal with a multi-label classifica-
application of BLSTM-RNN. We compare the performance of our tion problem for polyphonic sound event detection. Although these
proposed method to conventional methods such as non-negative networks provide good performance, they do not have an explicit
matrix factorization (NMF) and standard BLSTM-RNN, using the duration model for the output label sequence, and the actual output
DCASE2016 task 2 dataset. Our proposed method outperformed needs to be smoothed with careful thresholding to achieve the best
conventional approaches in both monophonic and polyphonic tasks, performance.
and finally achieved an average F1 score of 67.1 % (error rate of
In this paper, we propose a new polyphonic sound event de-
64.5 %) on the event-based evaluation, and an average F1-score of
tection method based on a hybrid system of bidirectional long
76.0 % (error rate of 50.0 %) on the segment-based evaluation.
short-term memory recurrent neural network and HMM (BLSTM-
Index Terms— Polyphonic Sound Event Detection, Bidirec- HMM). The proposed hybrid system is inspired by the BLSTM-
tional Long Short-Term Memory, Hidden Markov Model, multi- HMM hybrid system used in speech recognition [13, 14, 15, 16],
label classification where the output duration is controlled by an HMM on top of a
BLSTM network. We extend the hybrid system to polyphonic SED,
1. INTRODUCTION and more generally to the multi-label classification problem. Our
approach allows the smoothing of the frame-wise outputs without
Sounds include important information for various applications such post-processing and does not require thresholding.
as life-log, environmental context understanding, and monitoring The rest of this paper is organized as follows: Section 2 presents
system. To realize these applications, It is necessary to extract in- various types of recurrent neural networks and the concept of long
ternal information automatically from not only speech and music, short term memory. Section 3 describes our proposed method in de-
which have been studied for long time, but also other various types tail. Section 4 describes the design of our experiment and evaluates
of sounds. the performance of the proposed method and conventional methods.
Recently, studies related to sound event detection (SED) at- Finally, we conclude this paper and discuss future work in Section 5.
tracted much interest to aim for understanding various sounds. The
objective of SED systems is to identify the beginning and end of
sound events and to identify and label these sounds. SED is di- 2. RECURRENT NEURAL NETWORKS
vided into two scenarios, monophonic and polyphonic. Mono-
phonic sound event detection is under the restricted condition that 2.1. Recurrent Neural Network
the number of simultaneous active events is only one. On the other
hand, in polyphonic sound event detection, the number of simulta- A Recurrent Neural Network (RNN) is a layered neural network
neous active events is unknown. We can say that polyphonic SED which has a feedback structure. The structure of a simple RNN
is a more realistic task than monophonic SED because in real situa- is shown in Fig. 1. In comparison to feed-forward layered neural
tions, it is more likely that several sound events may happen simul- networks, RNNs can propagate prior time information forward to
taneously, or multiple sound events are overlapped. the current time, enabling them to understand context information
The most typical approach to SED is to use a Hidden Markov in a sequence of feature vectors. In other words, the hidden layer of
Model (HMM), where the emission probability distribution is repre- an RNN serves as a memory function.
sented by Gaussian Mixture Models (GMM-HMM), with Mel Fre- An RNN can be described mathematically as follows. Let us
quency Cepstral Coefficients (MFCCs) as features [1, 2]. Another denote a sequence of feature vectors as {x1 , x2 , ..., xT }. An RNN
approach is to utilize Non-negative Matrix Factorization (NMF) with a hidden layer output vector ht and output layer one yt are
Figure 1: Recurrent Neural Network Figure 3: Long Short-Term Memory
calculated as follows: Eq. 1 is replaced by the following equations:
gtI = σ(WI xt + WrI ht−1 + st−1 ), (3)

ht = f (W1 xt + Wr ht−1 + b1 ), (1)
gtF = σ(W xt + F
WrF ht−1 + st−1 ), (4)
yt = g(W2 ht + b2 ), (2)
st = gtI f (W1 xt + Wr ht−1 + b1 ) + gtF st−1 , (5)
where Wi and bi represent the input weight matrix and bias vector gtO = σ(WO xt + WrO ht−1 + st−1 ), (6)
of the i-th layer, respectively, Wr represents a recurrent weight ht = gtO tanh(st ), (7)
matrix, and f and g represent activation functions of the hidden
layer and output layer, respectively. where W and Wr denote input weight matrices and recurrent
weight matrices, respectively, subscripts I, F , and O represent the
input, forget, and output gates, respectively, represents point-wise
2.2. Bidirectional Recurrent Neural Network multiplication, and σ represents a logistic sigmoid function.
A Bidirectional Recurrent Neural Network (BRNN) [13, 17] is a 2.4. Projection Layer
layered neural network which not only has feedback from the pre-
vious time period, but also from the following time period. The Use of a projection layer is a technique which reduces the compu-
structure of a BRNN is shown in Fig. 2. The hidden layer which tational complexity of deep recurrent network structures, which al-
connects to the following time period is called the forward layer, lows the creation of very deep LSTM networks [14, 15]. The archi-
while the layer which connects to the previous time period is called tecture of an LSTM-RNN with a projection layer (LSTMP-RNN) is
the backward layer. Compared with conventional RNNs, BRNNs shown in Fig. 4. The projection layer, which is a linear transforma-
can propagate information not only from the past but also from the tion layer, is inserted after an LSTM layer, and the projection layer
future, and therefore have the ability to understand and exploit the outputs feedback to the LSTM layer. With the insertion of a projec-
full context in an input sequence. tion layer, the hidden layer output ht−1 in Eqs. 3-6 is replaced with
pt−1 and the following equation is added:
2.3. Long Short-Term Memory RNNs pt = WI ht , (8)
where Wl represents a projection weight matrix, and pt represents

One major problem with RNNs is that they cannot learn context in-
a projection layer output.
formation over long stretches of time because of the so-called van-
ishing gradient problem [19]. One effective solution to this problem
is to use Long Short-Term Memory (LSTM) architectures [20, 21]. 3. PROPOSED METHOD
LSTM architectures prevent vanishing gradient issues and allow the
memorization of long term context information. As illustrated in 3.1. Data generation
Fig. 3, LSTM layers are characterized by a memory cell st , and There are only 20 clean samples per sound event in the DCASE2016
three gates: 1) an input gate gtI , 2) a forget gate gtF , and 3) an task 2 training dataset. Since this is not enough data to train a deeply
output gate gtO . Each gate g∗ has a value between 0 and 1. The structured recurrent neural network, we synthetically generated our
value 0 means that the gate is closed, while the value 1 means that own training data from the provided data. The training data gen-
the gate is open. In an LSTM layer, the hidden layer output ht in eration procedure is as follows: 1) generate a silence signal of a
Figure 2: Bidirectional Recurrent Neural Network Figure 4: Long Short-Term Memory Recurrent Neural Network
with Projection Layer
predetermined length, 2) randomly select a sound event from the

training dataset, 3) add the selected sound event to the generated
silence signal at a predetermined location, 4) repeat Steps 2 and 3
a predetermined number of time, 5) add a background noise sig-
nal extracted from the development set at a predetermined signal to
noise ratio (SNR).
In this data generation operation, there are four hyper-
parameters; signal length, number of events in a signal, number
of overlaps, and SNR between sound events and background noise.
We set signal length to 4 seconds, number of events to a value from
3 to 5, number of overlaps to a value from 1 to 5, and SNR to a value
from -9 dB to 9 dB. We then generated 100,000 training samples of
Figure 6: Proposed model structure
4 seconds length, hence, about 111 hours of training data.
where c ∈ {1, 2, . . . C} represents the index of sound events, and
3.2. Feature extraction n ∈ {1, 2, . . . , N } represents the index of HMM P states, hence,
P (sc,t = n|xt ) satisfies the sum-to-one condition of n P (sc,t =
First, we modified the amplitude of the input sound signals to ad- n|xt ) = 1. In the BLSTM-HMM Hybrid model, HMM state poste-
just for the differences in recording conditions by normalizing the rior P (sc,t = n|xt ) is calculated using a BLSTM-RNN. The struc-
signals using the maximum amplitude of the input sound signals. ture of the network is shown in Fig. 6. This network has three hid-
Second, the input signal was divided into 25 ms windows with a den layers which consist of an LSTM layer, a projection layer, and
40 % overlap, and we calculated a log filterbank feature for each the number of output layer nodes is C × N . All values of the poste-
window in 100 Mel bands (more bands than usual since high fre- rior P (sc,t |xt ) have the sum-to-one condition for each sound event
quency components are more important than low frequency ones for c at frame t, it is obtained by the following softmax operations
SED). Finally, we conducted cepstral mean normalization (CMN)
for each piece of training data. Feature vectors were calculated us- exp(ac,n,t )
ing HTK [22]. P (sc,t = n|xt ) = PN , (10)
n =1 exp(ac,n ,t )
0 0
3.3. Model where a represents the activation of output layer node. The network
was optimized using back-propagation through time (BPTT) with
We extended the hybrid HMM/neural network model in order to Stochastic Gradient Descent (SGD) and dropout under the cross-
handle a multi-label classification problem. To do this, we built a entropy for multi-class multi-label objective function
three state left-to-right HMM with a non-active state for each sound
C X
N X
T
event. The structure of our HMM is shown in Fig. 5, where n = 0, X
n = 5 and n = 4 represent the initial state, final state, and non- E(Θ) = yc,n,t ln(P (sc,t = n|xt )), (11)
c=1 n=1 t=1
active state, respectively. Notice that the non-active state represents
not only the case where there is no active event, but also the case where Θ represents the set of network parameters, and yc,n,t is the
where other events are active. Therefore, the non-active state of HMM state label obtained from the maximum likelihood path at
each sound event HMM has a different meaning from the silence. frame t. (Note that this is not the same as the multi-class objective
In this study, we fix all transition probabilities to a constant value function in conventional DNN-HMM.) HMM state prior P (sc,t )
of 0.5. is calculated by counting the number of occurrence of each HMM
Using Bayes’ theorem, HMM state emission probability state. However, in this study, because our synthetic training data
P (xt |sc,t = n) can be approximated as follows does not represent the actual sound event occurrences, the prior ob-
tained from occurrences of HMM states has to be made less sensi-
P (sc,t = n|xt )P (xt ) tive. Therefore, we smoothed P (sc,t ) as follows
P (xt |sc,t = n) =
P (sc,t = n)
(9) P̂ (sc,t ) = P (sc,t )α , (12)
P (sc,t = n|xt )
'
P (sc,t = n) where α is a smoothing coefficient. In this study, we set α = 0.01.
Finally, we calculated the HMM state emission probability using
Eq. 9 and obtained the maximum likelihood path using the Viterbi
algorithm.
4. EXPERIMENTS
4.1. Experimental condition

We evaluated our proposed method by using the DCASE2016 task 2
dataset [18, 6]. In this study, we randomly selected 5 samples per
event from training data, and generated 18 samples which have
120 sec length just like DCASE2016 task 2 development set using
selected samples. These generated samples are used as develop-
Figure 5: Hidden Markov Model of each sound event ment set for open condition evaluation, and remaining 15 samples
Table 1: Experimental conditions Table 3: Effect of background noise

Sampling rate 44,100 Hz
Frame size 25 ms Event-based Segment-based
Shift size 10 ms EBR [dB] F1 [%] ER [%] F1 [%] ER [%]
Learning rate 0.0005 -6 73.7 58.0 86.0 28.0
Initial scale 0.001 0 76.7 51.3 87.4 25.9
Gradient clipping norm 5 6 79.6 44.1 88.1 23.9
Batch size 64
Time steps 400
Epoch 20
event detection. As regards post-processing, in study [8], the au-
thors reported that they did not require post-processing since RNN
outputs have already been smoothed. However, we confirmed that
per class are used for training. Evaluation is conducted by using post-processing is still effective, especially for event-based evalua-
two metrics: event-based evaluation, and segment-based evaluation, tion. In addition, although RNN outputs are smoother than the out-
where F1-score (F1) and error rate (ER) are utilized as evaluation puts of neural networks without a recurrent structure, there is still
criteria (see [24] for more details). room for improvement by smoothing RNN outputs. Our proposed
We built our proposed model using the following procedure: 1) method achieved the best performance on the development set for
divide an active event into three segments with equal intervals in event-based evaluation, which supports this assertion.
order to assign left-to-right HMM state labels, 2) train the BLSTM-
RNN using these HMM state labels as supervised data, 3) calcu- 4.3. Analysis
late the maximum likelihood path with the Viterbi algorithm using
RNN output posterior, 4) train the BLSTM-RNN by using the ob- In this section, we focus on the factors which influenced the per-
tained maximum likelihood path as supervised data, 5) repeat step formance of our proposed method using the development set. The
3 and step 4. In this study, when calculating the maximum like- first factor is SNR between background noise and events. The per-
lihood path, we fixed the alignment of non-active states, i.e., we formance of our proposed method on the development set for each
just aligned event active HMM states. When training the networks, SNR condition is shown in Table 3. From these results, there are
we checked the error for development data every epoch, and if the clear differences in performance between the different SNR con-
error became bigger than in the previous epoch, we restored the ditions. This is because the loud background noise caused more
parameters of the previous epoch and re-trained the network with insertion errors, especially small loudness events such as doorslam
a halved learning rate. All networks were trained using the open and pageturn.
source toolkit TensorFlow [23] with a single GPU (Nvidia Titan X). The second factor is the difference in performance between the
Details of the experimental conditions are shown in Table 1. monophonic and polyphonic tasks. The performance of our pro-
posed method on each type of tasks is shown in Table 4. In gen-
4.2. Comparison with conventional methods eral, the polyphonic task is more difficult than the monophonic task.
To confirm the performance of our proposed method, we compared However, we observed a strange behavior with better scores in the
it with the following four methods: 1) NMF (DCASE2016 task2 polyphonic task than in the monophonic task, while the opposite is
baseline), 2) BLSTM-RNN, 3) BLSTM-RNN disregarding a few normally to be expected. We will investigate the reason as a future
missing frames, 4) BLSTM-RNN with median filter smoothing. work.
NMF is trained using the remaining 15 samples per class by the
DCASE2016 task2 baseline script. In this study, we do not change 5. CONCLUSION
any settings except for the number of training samples. BLSTM-
RNN has the same network structure as BLSTM-HMM with the We proposed a new method of polyphonic sound event detec-
exception that the number of output layer nodes which have a sig- tion based on a Bidirectional Long Short-Term Memory Hidden
moid function as an activation function corresponds to the num- Markov Model hybrid system (BLSTM-HMM), and applied it to
ber of sound events C. Each node conducts a binary classification, the DCACE2016 challenge task 2. We compared our proposed
hence, each output node yc is between 0 and 1. We set the thresh- method to baseline non-negative matrix factorization (NMF) and
old as 0.5, i.e., yc > 0.5 represents sound event c being active, and standard BLSTM-RNN methods. Our proposed method outper-
yc ≤ 0.5 non-active. For post-processing, we applied two meth- formed them in both monophonic and polyphonic tasks, and finally
ods: median filtering, and disregarding a few missing frames. In achieved an average F1-score of 67.1% (error rate of 64.5%) on the
this step, we set the degree of median filtering to 9, and the number event-based evaluation, and an average F1-score 76.0% (error rate
of disregarded frames to 10. of 50.0%) on the segment-based evaluation.
Experimental results are shown in Table 2. Note that the results In future work, we will investigate the reason for the counter-
on the test set are provided by DCASE2016 organizers [18]. From intuitive results in the difference between monophonic and poly-
the results, we can see that the methods based on BLSTM are sig- phonic task, the use of sequence discriminative training for
nificantly better than the NMF-based method in polyphonic sound BLSTM-HMM, and we will apply our proposed method to a real-
recording dataset.
Table 2: Experimental results
Event-based (dev / test) Segment-based (dev / test) Table 4: Difference in the performance between monophonic and
F1 [%] ER [%] F1 [%] ER [%] polyphonic task
NMF (Baseline) 14.6 / 24.2 665.4 / 168.5 35.9 / 37.0 183.4 / 89.3
BLSTM 66.5 / - 85.3 / - 87.0 / - 25.9 / - Event-based Segment-based
BLSTM (w/ disregard) 75.9 / - 52.5 / - 87.0 / - 25.9 / - F1 [%] ER [%] F1 [%] ER [%]
BLSTM (w/ median) 75.8 / 68.2 53.2 / 60.0 87.7 / 78.1 24.2 / 40.8 Monophonic 76.2 54.0 84.3 32.4
BLSTM-HMM 76.6 / 67.1 51.1 / 64.4 87.2 / 76.0 25.9 / 50.0 Polyphonic 76.9 49.6 88.7 22.5
6. REFERENCES [16] Z. Chen, S. Watanabe, H. Erdogan, and J. Hershey, “Integra-

tion of speech enhancement and recognition using long Short-
[1] J. Schrder, B. Cauchi, M. R. Schdler,N. Moritz, K. Adiloglu, term memory recurrent neural network, in Proc. IEEE INTER-
J. Anemller, and S. Goetze, “Acoustic event detection using SPEECH, 2015, pp. 3274–3278.
signal enhancement and spectro-temporal feature extraction,” [17] M. Schuster, and K. K. Paliwal, “Bidirectional recurrent neu-
in Proc. WASPAA, 2013. ral networks,” IEEE Transactions on Signal Processing, Vol.
[2] T. Heittola, A. Mesaros, A. Eronen, and T. Virtunen, “Context- 45, No. 11, 1997, pp. 2673–2681.
dependent Sound Event Detection,” EURASIP Journal on Au- [18] http://www.cs.tut.fi/sgn/arg/dcase2016/.
dio, Speech, and Music Processing, Vol. 2013, No.1, 2013,
pp. 1–13. [19] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term
dependencies with gradient descent is difficult,” IEEE trans-
[3] T. Heittola, A. Mesaros, T. Virtanen, and A. Eronen, “Sound actions on neural networks, Vol. 5, No. 2, 1994, pp.157–166.
event detection in multisource environments using source sep-
aration,” in Workshop on machine listening in Multisource En- [20] S. Hochreiter, and J. Schmidhuber, “Long short-term mem-
vironments, 2011, pp. 36–40. ory,” Neural Computation, No. 9, Vol. 8 ,1997, pp. 1735–
1780.
[4] S. Innami, and H. Kasai, “NMF-based environmental sound
[21] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to for-
source separation using time-variant gain features,” Comput-
get: Continual prediction with LSTM,” Artificial Neural Net-
ers & Mathematics with Applications, Vol. 64, No. 5, 2012,
works, Vol. 12, No. 10, 1999, pp. 2451–2471.
pp.1333–1342.
[22] http://htk.eng.cam.ac.uk
[5] A. Dessein, A. Cont, and G. Lemaitre, “Real-time detection of
overlapping sound events with non-negative matrix factoriza- [23] https://www.tensorflow.org
tion,” Springer Matrix Information Geometry, 2013, pp. 341– [24] A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for poly-
371. phonic sound event detection”, Applied Sciences, Vol. 6, No.
[6] A. Mesaros, T. Heittola, and T. Virtanen, “TUT database for 6, 2016, 162.
acoustic scene classification and sound event detection,” in
Proc. EUSIPCO, 2016.
[7] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Poly-
phonic sound event detection using multi label deep neural
networks,” in Proc. IEEE IJCNN, 2015, pp. 1–7.
[8] G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent
neural networks for polyphonic sound event detection in real
life recordings,” in Proc. IEEE ICASSP, 2016, pp. 6440–6444.
[9] T. Hayashi, M. Nishida, N. Kitaoka, and K. Takeda, “Daily ac-
tivity recognition based on DNN using environmental sound
and acceleration signals, ” in Proc. EUSIPCO, 2015, pp.
2306–2310.
[10] M. Espi, M. Fujimoto, K. Kinoshita, T. Nakatani, “Exploit-
ing spectro-temporal locality in deep learning based acous-
tic event detection,” EURASIP Journal on Audio, Speech, and
Music Processing, Vol.1, No.1, 2015.
[11] Y. Wang, L. Neves, and F. Metze, “Audio-based multime-
dia event detection using deep recurrent neural networks,” in
IEEE ICASSP, 2016, pp. 2742–2746.
[12] F. Eyben, S. Bck, B. Schuller, and A. Graves, “Universal On-
set Detection with Bidirectional Long Short-Term Memory
Neural Networks,” in ISMIR, 2010, pp. 589–594.
[13] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition
with deep recurrent neural networks,” in Proc. IEEE ICASSP,
2013 , pp. 6645–6649.
[14] H. Sak, A. Senior, and F. Beaufays, “Long short-term mem-
ory based recurrent neural network architectures for large vo-
cabulary speech recognition, ArXiv e-prints arXiv:1402.1128,
2014.
[15] H. Sak et. al., “Long short-term memory recurrent neural net-
work architectures for large scale acoustic modeling,” in Proc.
IEEE INTERSPEECH, 2014.

Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary

Uploaded by

Copyright:

Available Formats

Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary

Uploaded by

Copyright:

Available Formats

Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM

ABSTRACT [3, 4, 5]. In the NMF approaches, a dictionary of basis vectors

Figure 1: Recurrent Neural Network Figure 3: Long Short-Term Memory

calculated as follows: Eq. 1 is replaced by the following equations:

gtI = σ(WI xt + WrI ht−1 + st−1 ), (3)

2.3. Long Short-Term Memory RNNs pt = WI ht , (8)

where Wl represents a projection weight matrix, and pt represents

predetermined length, 2) randomly select a sound event from the

4.1. Experimental condition

Table 1: Experimental conditions Table 3: Effect of background noise

6. REFERENCES [16] Z. Chen, S. Watanabe, H. Erdogan, and J. Hershey, “Integra-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.