Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary
Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary
Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary
Tomoki Hayashi1 , Shinji Watanabe2 , Tomoki Toda1 , Takaaki Hori2 , Jonathan Le Roux2 , Kazuya Takeda1
1
Nagoya University, Furo-cho, Chikusa-ku, Nagoya, Aichi 464-8603, Japan
2
Mitsubishi Electric Research Laboratories (MERL), 201 Broadway, Cambridge, MA 02139, USA
hayashi.tomoki@g.sp.m.is.nagoya-u.ac.jp,
{takeda,tomoki}@is.nagoya-u.ac.jp, {watanabe,thori,leroux}@merl.com
A Bidirectional Recurrent Neural Network (BRNN) [13, 17] is a 2.4. Projection Layer
layered neural network which not only has feedback from the pre-
vious time period, but also from the following time period. The Use of a projection layer is a technique which reduces the compu-
structure of a BRNN is shown in Fig. 2. The hidden layer which tational complexity of deep recurrent network structures, which al-
connects to the following time period is called the forward layer, lows the creation of very deep LSTM networks [14, 15]. The archi-
while the layer which connects to the previous time period is called tecture of an LSTM-RNN with a projection layer (LSTMP-RNN) is
the backward layer. Compared with conventional RNNs, BRNNs shown in Fig. 4. The projection layer, which is a linear transforma-
can propagate information not only from the past but also from the tion layer, is inserted after an LSTM layer, and the projection layer
future, and therefore have the ability to understand and exploit the outputs feedback to the LSTM layer. With the insertion of a projec-
full context in an input sequence. tion layer, the hidden layer output ht−1 in Eqs. 3-6 is replaced with
pt−1 and the following equation is added:
Figure 2: Bidirectional Recurrent Neural Network Figure 4: Long Short-Term Memory Recurrent Neural Network
with Projection Layer
Detection and Classification of Acoustic Scenes and Events 2016 3 September 2016, Budapest, Hungary
3.3. Model where a represents the activation of output layer node. The network
was optimized using back-propagation through time (BPTT) with
We extended the hybrid HMM/neural network model in order to Stochastic Gradient Descent (SGD) and dropout under the cross-
handle a multi-label classification problem. To do this, we built a entropy for multi-class multi-label objective function
three state left-to-right HMM with a non-active state for each sound
C X
N X
T
event. The structure of our HMM is shown in Fig. 5, where n = 0, X
n = 5 and n = 4 represent the initial state, final state, and non- E(Θ) = yc,n,t ln(P (sc,t = n|xt )), (11)
c=1 n=1 t=1
active state, respectively. Notice that the non-active state represents
not only the case where there is no active event, but also the case where Θ represents the set of network parameters, and yc,n,t is the
where other events are active. Therefore, the non-active state of HMM state label obtained from the maximum likelihood path at
each sound event HMM has a different meaning from the silence. frame t. (Note that this is not the same as the multi-class objective
In this study, we fix all transition probabilities to a constant value function in conventional DNN-HMM.) HMM state prior P (sc,t )
of 0.5. is calculated by counting the number of occurrence of each HMM
Using Bayes’ theorem, HMM state emission probability state. However, in this study, because our synthetic training data
P (xt |sc,t = n) can be approximated as follows does not represent the actual sound event occurrences, the prior ob-
tained from occurrences of HMM states has to be made less sensi-
P (sc,t = n|xt )P (xt ) tive. Therefore, we smoothed P (sc,t ) as follows
P (xt |sc,t = n) =
P (sc,t = n)
(9) P̂ (sc,t ) = P (sc,t )α , (12)
P (sc,t = n|xt )
'
P (sc,t = n) where α is a smoothing coefficient. In this study, we set α = 0.01.
Finally, we calculated the HMM state emission probability using
Eq. 9 and obtained the maximum likelihood path using the Viterbi
algorithm.
4. EXPERIMENTS