CTC
CTC
Andrew Maas
Stanford University
Spring 2017
likelihood prior
ˆ
W = argmax P(O |W )P(W )
W ∈L
!"#$%&'()!!*+
!"#$%&'!&()&(%&
-.'/#!0%'1&'
)2&'.""3'".'4"5&666 !"#$%&$'!('!)'
!"#$!"%
75&$8'2+998'.+/048 !"#$%&*
*#&!!'+)'!"#$%&, -('+'2"4&'0(')2&'*$"#(3 !"#$%&+
&&&
-.'/#!0%'1&' -.'/#!0%'1&')2&'.""3'".'4"5& !"#$%&,
)2&'.""3'".'4"5&666
Hidden Markov
942 942 6
Model (HMM):
Acoustic Model:
%&*
%&)
%&*
%&)
*&%
)&%
GMM models:
%&(
%&'
%&(
%&'
(&%
'&%
P(x|s)
%&!
%&"
%&!
%&"
!&%
"&%
x: input features
Audio Input: s: HMM state
%&# %&# #&%
% % %
!! !" !# !$ % $ # " ! !! !" !# !$ % $ # " ! ! " # $ % $! #! "! !!
Hidden Markov
942 942 6
Model (HMM):
35
35
30
30
25
25
20 20
GMM 36M 100M 200M 400M
Model Size
(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017) Stanford CS224S Spring 2017
Recurrent DNN Hybrid Acoustic Models
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …
Hidden Markov
942 942 6
Model (HMM):
Acoustic Model:
Hidden Layer
Hidden Layer
Input
Hidden Markov
942 942 6
Model (HMM):
Acoustic Model:
Characters:
SAMSON
Collapsing
SS___AA_M_S___O___NNNN
function:
S S _
P(a|x1) P(a|x2) P(a|x3) Use a DNN to approximate:
Acoustic Model: P(a|x)
Audio Input:
Features (x1) Features (x2) Features (x3)
THIS PARCLE GUNA COME BACK ON THIS ILAND SOM DAY SOO
THE SPARKLE GONNA COME BACK ON THIS ISLAND SOMEDAY SOON
TRADE REPRESENTIGD JUIDER WARANTS THAT THE U S WONT BACKCOFF ITS PUSH
FOR TRADE BARIOR REDUCTIONS
TRADE REPRESENTATIVE YEUTTER WARNS THAT THE U S WONT BACK OFF ITS PUSH
FOR TRADE BARRIER REDUCTIONS
(Graves, Fernández, Gomez, & Schmidhuber. 2006) Stanford CS224S Spring 2017
Decoding with a Language Model
Character Error Rate
12
Lexicon [a, …, zebra]
8
0
None Lexicon Bigram
Character
__oo_h__y_e_aa_h
Probabilities
Word Error Rate
40
30
20
10
0
None Lexicon Bigram
(Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224S Spring 2017
Loss functions and architecture
What function to fit How do we approximate
Loss function that function
HMM-DNN uses Neural network
independent per-frame architecture
classification with force HMM-DNN typically fine
alignment hard labels with just DNN
CTC independent per- CTC needs recurrent NN
frame but cleverly
allows for multiple
possible labelings
(Graves, Fernández, Gomez, & Schmidhuber. 2006) Stanford CS224S Spring 2017
Recurrence Matters!
S S _
P(a|x1) P(a|x2) P(a|x3) Architecture CER
DNN 22
+ recurrence 13
+ bi-directional 10
recurrence
Features (x1) Features (x2) Features (x3)
(Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224S Spring 2017
CTC Loss Function
Maximum log likelihood training of transcript
Intuition: Alignments are unknown so integrate over
all possible time-character alignments
Example: W = “hi”, T = 3
possible C such that K(C) = W:
hhi, hii, _hi, h_i, hi_
After collapsing:
yet a rehbilitation cru is onhand in the building loogging bricks plaster and blueprins four forty two new betin epartments
Reference:
yet a rehabilitation crew is on hand in the building lugging bricks plaster and blueprints for forty two new bedroom
apartments
(Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224S Spring 2017
Rethinking Decoding
Out of Vocabulary Words
syriza bae
abo--
Lexicon [a, …, zebra] sof--
schmidhuber
Character Character
__oo_h__y_e_aa_h __oo_h__y_e_aa_h
Probabilities Probabilities
(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Beam Search Decoding
35
30
25
20
15
10
0
HMM-GMM CTC No LM CTC + 7-gram CTC + NN LM HMM-DNN
(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Example Results (Switchboard) ~19%
CER
i i don'tknow i don't know what the rain force have to do with it but you know their
chop a those down af the tr minusrat everyday
i- i don't kn- i don't know what the rain forests have to do with it but you know they're
chopping those down at a tremendous rate everyday
i guess down't here u we just recently move to texas so my wor op has change quite a
bit muh we ook from colorado were and i have a cloveful of sweatterso tuth
i guess down here uh we just recently moved to texas so my wardrobe has changed
quite a bit um we moved from colorado where and i have a closet full of sweaters that
i don't know whether state lit state hood whold itprove there a conomy i don't i don't
know that to that the actove being a state
i don't know whether state woul- statehood would improve their economy i don't i
don't know that the ve- the act of being a state
(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Comparing CLMs
Switchboard Word Error Rate
40
35
30
25
20
15
10
0
No LM 5-gram 7-gram NN 1H NN 3H RNN 1H RNN 3H
All NN models have 5M total parameters
(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Transcribing Out of Vocabulary Words
Truth: yeah i went into the i do not know what you think of fidelity but
HMM-GMM: yeah when the i don’t know what you think of fidel it even them
CTC-CLM: yeah i went to i don’t know what you think of fidelity but um
Truth: i would ima- well yeah it is i know you are able to stay home with them
HMM-GMM: i would amount well yeah it is i know um you’re able to stay home with them
CTC-CLM: i would ima- well yeah it is i know uh you’re able to stay home with them
(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Comparing Alignments
(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Scaling end to end models: Baidu
deep speech
Normalize
Normalize
Language Hours
English 12,000
Mandarin 10,000
Speech
Noise
Noisy
Speech
Model 1 Model 2
Share weight
updates each
iteration
InfiniBand
Model 3 Model 4
(Chan, Jaitly, Le, & Vinyals. 2015) Stanford CS224S Spring 2017
Listen, Attend, and Spell
(Chan, Jaitly, Le, & Vinyals. 2015) Stanford CS224S Spring 2017
Listen, Attend, and Spell
(Chan, Jaitly, Le, & Vinyals. 2015) Stanford CS224S Spring 2017
Attention-based sequence generation
Maximum likelihood conditional language model
given the audio