0% found this document useful (0 votes)
142 views

CTC

This document outlines a lecture on end-to-end neural network speech recognition. It discusses connectionist temporal classification (CTC), scaling up end-to-end neural approaches, and alternative end-to-end methods. The lecture also includes an outline covering ASR discussions so far and HW3.

Uploaded by

datbaclieu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
142 views

CTC

This document outlines a lecture on end-to-end neural network speech recognition. It discusses connectionist temporal classification (CTC), scaling up end-to-end neural approaches, and alternative end-to-end methods. The lecture also includes an outline covering ASR discussions so far and HW3.

Uploaded by

datbaclieu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

CS 224S / LINGUIST 285

Spoken Language Processing

Andrew Maas
Stanford University
Spring 2017

Lecture 8: End-to-end neural network speech


recognition
Outline
— ASR discussion thus far
— Connectionist temporal classification (CTC)
— Lexicon-free CTC
— Scaling up end-to-end neural approaches
— Alternative end-to-end approaches
— HW3 discussion

Stanford CS224S Spring 2017


Noisy channel model

likelihood prior

ˆ
W = argmax P(O |W )P(W )
W ∈L

Stanford CS224S Spring 2017


The noisy channel model
Ignoring the denominator leaves us with two
factors: P(Source) and P(Signal|Source)

!"#$%&'()!!*+
!"#$%&'!&()&(%&
-.'/#!0%'1&'
)2&'.""3'".'4"5&666 !"#$%&$'!('!)'

!"#$!"%
75&$8'2+998'.+/048 !"#$%&*
*#&!!'+)'!"#$%&, -('+'2"4&'0(')2&'*$"#(3 !"#$%&+
&&&
-.'/#!0%'1&' -.'/#!0%'1&')2&'.""3'".'4"5& !"#$%&,
)2&'.""3'".'4"5&666

Stanford CS224S Spring 2017


HMM for the
digit
recognition
task

Stanford CS224S Spring 2017


Acoustic Modeling with GMMs
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …

Hidden Markov
942 942 6
Model (HMM):

Acoustic Model:
%&*

%&)
%&*

%&)
*&%

)&%
GMM models:
%&(

%&'
%&(

%&'
(&%

'&%
P(x|s)
%&!

%&"
%&!

%&"
!&%

"&%
x: input features
Audio Input: s: HMM state
%&# %&# #&%

%&$ %&$ $&%

% % %
!! !" !# !$ % $ # " ! !! !" !# !$ % $ # " ! ! " # $ % $! #! "! !!

Features Features Features

Stanford CS224S Spring 2017


DNN Hybrid Acoustic Models
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …

Hidden Markov
942 942 6
Model (HMM):

P(s|x1) P(s|x2) P(s|x3) Use a DNN to approximate:


P(s|x)
Acoustic Model:
Apply Bayes’ Rule:
P(x|s) = P(s|x) * P(x) / P(s)

DNN * Constant / State prior

Audio Input: Features (x1) Features (x2) Features (x3)

Stanford CS224S Spring 2017


50
Framework + Isolated Training 45
Limitations Frame Error Rate
Word Error Rate
45
40

RT-03 Word Error Rate


40
Frame Error Rate

35

35

30
30

25
25

20 20
GMM 36M 100M 200M 400M
Model Size
(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017) Stanford CS224S Spring 2017
Recurrent DNN Hybrid Acoustic Models
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …

Hidden Markov
942 942 6
Model (HMM):

P(s|x1) P(s|x2) P(s|x3)

Acoustic Model:

Audio Input: Features (x1) Features (x2) Features (x3)

Stanford CS224S Spring 2017


Deep Recurrent Network
Output Layer

Hidden Layer

Hidden Layer

Input

Stanford CS224S Spring 2017


HMM-Free Recognition
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …

Hidden Markov
942 942 6
Model (HMM):

P(s|x1) P(s|x2) P(s|x3)

Acoustic Model:

Audio Input: Features (x1) Features (x2) Features (x3)

(Graves & Jaitly. 2014) Stanford CS224S Spring 2017


HMM-Free Recognition
Transcription: Samson

Characters:
SAMSON
Collapsing
SS___AA_M_S___O___NNNN
function:

S S _
P(a|x1) P(a|x2) P(a|x3) Use a DNN to approximate:
Acoustic Model: P(a|x)

The distribution over characters

Audio Input:
Features (x1) Features (x2) Features (x3)

(Graves & Jaitly. 2014) Stanford CS224S Spring 2017


Example Results (WSJ)
YET A REHBILITATION CRU IS ONHAND IN THE BUILDING LOOGGING BRICKS PLASTER
AND BLUEPRINS FOUR FORTY TWO NEW BETIN EPARTMENTS
YET A REHABILITATION CREW IS ON HAND IN THE BUILDING LUGGING BRICKS PLASTER
AND BLUEPRINTS FOR FORTY TWO NEW BEDROOM APARTMENTS

THIS PARCLE GUNA COME BACK ON THIS ILAND SOM DAY SOO
THE SPARKLE GONNA COME BACK ON THIS ISLAND SOMEDAY SOON

TRADE REPRESENTIGD JUIDER WARANTS THAT THE U S WONT BACKCOFF ITS PUSH
FOR TRADE BARIOR REDUCTIONS
TRADE REPRESENTATIVE YEUTTER WARNS THAT THE U S WONT BACK OFF ITS PUSH
FOR TRADE BARRIER REDUCTIONS

TREASURY SECRETARY BAGER AT ROHIE WOS IN AUGGRAL PRESSED FOUR ARISE IN


THE VALUE OF KOREAS CURRENCY
TREASURY SECRETARY BAKER AT ROH TAE WOOS INAUGURAL PRESSED FOR A RISE IN
THE VALUE OF KOREAS CURRENCY

Stanford CS224S Spring 2017


Earlier work on CTC with phonemes

(Graves, Fernández, Gomez, & Schmidhuber. 2006) Stanford CS224S Spring 2017
Decoding with a Language Model
Character Error Rate
12
Lexicon [a, …, zebra]
8

Language p(“yeah” | “oh”) 4


Model

0
None Lexicon Bigram
Character
__oo_h__y_e_aa_h
Probabilities
Word Error Rate
40
30
20
10
0
None Lexicon Bigram

(Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224S Spring 2017
Loss functions and architecture
— What function to fit — How do we approximate
— Loss function that function
— HMM-DNN uses — Neural network
independent per-frame architecture
classification with force — HMM-DNN typically fine
alignment hard labels with just DNN
— CTC independent per- — CTC needs recurrent NN
frame but cleverly
allows for multiple
possible labelings

Stanford CS224S Spring 2017


CTC loss during training

(Graves, Fernández, Gomez, & Schmidhuber. 2006) Stanford CS224S Spring 2017
Recurrence Matters!
S S _
P(a|x1) P(a|x2) P(a|x3) Architecture CER
DNN 22
+ recurrence 13
+ bi-directional 10
recurrence
Features (x1) Features (x2) Features (x3)

(Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224S Spring 2017
CTC Loss Function
— Maximum log likelihood training of transcript
— Intuition: Alignments are unknown so integrate over
all possible time-character alignments

— Example: W = “hi”, T = 3
possible C such that K(C) = W:
hhi, hii, _hi, h_i, hi_

(Graves & Jaitly. 2014) Stanford CS224S Spring 2017


CTC Objective Function
Labels at each time index are conditionally independent
(like HMMs)

Sum over all time-level labelings consistent with the


output label.
Output label: AB
Time-level labelings: AB, _AB, A_B, … _A_B_

Final objective maximizes probability of true labels:

(Graves & Jaitly, ICML 2014) Stanford CS224S Spring 2017


Collapsing Example
Per-frame argmax:
____________________________________________________________________________________________________
yy__ee_________tt_ ____________________________________________a_____
_rr__e________hh__________b___ii_______lll__i_____tt______aa______tt_______iio__n___
___cc_____rrr_u_____________________ ________ii___ss
______________o__________nn_____________hhh_a___________________nnddd ________________i__n___
__thh_e_____ __________________________________________bb_uuii_______lllldd____ii____nng_____
___________________________________l___o___o_g__g___ii____nng______
____b___rr_ii________ck__s__________________________________________p___ll__a________sstt_________eerr__
______a___nnd_ ___b___lll_uu____ee__pp___r___i________nnss_
________________f______oou____________rrr________ _____________f_____oo__rrr__tt_y____
_____t____www_oo__________ ____nn___ew___________________
______________________________________________________b___e_______t__________i____n___
____e________pp_____aa___rr___tt____mm_ee___nnntss
____________________________________________________________________________________________________
_________________________________

After collapsing:
yet a rehbilitation cru is onhand in the building loogging bricks plaster and blueprins four forty two new betin epartments

Reference:
yet a rehabilitation crew is on hand in the building lugging bricks plaster and blueprints for forty two new bedroom
apartments

(Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224S Spring 2017
Rethinking Decoding
Out of Vocabulary Words
syriza bae
abo--
Lexicon [a, …, zebra] sof--
schmidhuber

Language p(“yeah” | “oh”) Character p(h | o,h, ,y,e,a,)


Model Language
Model

Character Character
__oo_h__y_e_aa_h __oo_h__y_e_aa_h
Probabilities Probabilities

(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Beam Search Decoding

Stanford CS224S Spring 2017


Lexicon-Free & HMM-Free on Switchboard
40

35

30

25

20

15

10

0
HMM-GMM CTC No LM CTC + 7-gram CTC + NN LM HMM-DNN

(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Example Results (Switchboard) ~19%
CER
i i don'tknow i don't know what the rain force have to do with it but you know their
chop a those down af the tr minusrat everyday
i- i don't kn- i don't know what the rain forests have to do with it but you know they're
chopping those down at a tremendous rate everyday

come home and get back in to regular cloos aga


come home and get back into regular clothes again

i guess down't here u we just recently move to texas so my wor op has change quite a
bit muh we ook from colorado were and i have a cloveful of sweatterso tuth
i guess down here uh we just recently moved to texas so my wardrobe has changed
quite a bit um we moved from colorado where and i have a closet full of sweaters that

i don't know whether state lit state hood whold itprove there a conomy i don't i don't
know that to that the actove being a state
i don't know whether state woul- statehood would improve their economy i don't i
don't know that the ve- the act of being a state

(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Comparing CLMs
Switchboard Word Error Rate
40

35

30

25

20

15

10

0
No LM 5-gram 7-gram NN 1H NN 3H RNN 1H RNN 3H
All NN models have 5M total parameters
(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Transcribing Out of Vocabulary Words
Truth: yeah i went into the i do not know what you think of fidelity but
HMM-GMM: yeah when the i don’t know what you think of fidel it even them
CTC-CLM: yeah i went to i don’t know what you think of fidelity but um

Truth: no no speaking of weather do you carry a altimeter slash barometer


HMM-GMM: no i’m not all being the weather do you uh carry a uh helped emitters last brahms her
CTC-CLM: no no beating of whether do you uh carry a uh a time or less barometer

Truth: i would ima- well yeah it is i know you are able to stay home with them
HMM-GMM: i would amount well yeah it is i know um you’re able to stay home with them
CTC-CLM: i would ima- well yeah it is i know uh you’re able to stay home with them

(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Comparing Alignments

HMM-GMM phone probabilities CTC character probabilities

(HMM slide from Dan Ellis) Stanford CS224S Spring 2017


Learning Phonemes and Timing
— Take all phone segments from
HMM-GMM alignments (k)
— Align all segments to start at
the same time = 0
— Compute the average CTC
character probabilities during
the segment (c, e, k)
— Vertical line shows median
end time of phone segment
from HMM-GMM alignments

Stanford CS224S Spring 2017


Learning Phonemes and Timing

(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Scaling end to end models: Baidu
deep speech

(Hannun et al. 2014) Stanford CS224S Spring 2017


Deep Speech – Deep RNN

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech – Batch Norm for RNNs

Normalize

Normalize

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech – Batch Norm for RNNs

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech - Hours of speech data

Language Hours
English 12,000
Mandarin 10,000

Where does the data come from?


● Public benchmarks (English)

● Internal manually labelled data (English and Mandarin)

● Captioned videos (English and Mandarin)

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech - Captioned Video Data Pipeline

1. Download publicly available video + captions.

1. Align caption to video with CTC Model

1. Segment at regions of silence

1. Use simple classifier to throw out very noisy


samples.

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech - Captioned Video Data Pipeline
Align with a model trained with CTC?

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech - Even more data!
Augmentation: noise synthesis, reverb, time-stretching, pitch-shifting,...

Speech

Noise

Noisy
Speech

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech – Data Parallel GPU Scaling

Model 1 Model 2

Share weight
updates each
iteration

InfiniBand

Model 3 Model 4

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech – Data Parallel GPU Scaling
Custom Ring Reduce avoids extraneous copies to CPU memory.

# GPUs OpenMPI All-reduce Custom All-reduce Factor Speedup


(s)* (s)*
4 55359 2587 21.4
8 48881 2470 19.8
16 21562 1393 15.5

*Measures time spent in all-reduce for a single epoch.

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech – Data Parallel GPU Scaling

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech – Some results

Architecture English (WER) Mandarin (WER)

5-layer 1-RNN 13.55 15.41

5-layer 3-RNN 11.61 11.85

5-layer 3-RNN + BatchNorm 10.56 9.39

9-layer 7-RNN + BatchNorm + 9.52 7.93


Frequency Convolution

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech – Deployment

● Bi-directional models give almost 10% relative boost …


but we can’t deploy them.

● ASR latencies for voice search <50ms

● For 3 second audio would need to decode 60x faster than


realtime!

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech – Lookahead convolution

Slides from Awni Hannun Stanford CS224S Spring 2017


Deep Speech – Lookahead convolution

For a lookahead of 20 time-steps (about 800ms in the


future)

Model English (WER) Chinese (WER)


Forward only 18.8 15.7
Forward + Lookahead (+50k params) 16.8 13.5
Bidirectional (+12M params) 15.4 12.8

Slides from Awni Hannun Stanford CS224S Spring 2017


Listen, Attend, and Spell

(Chan, Jaitly, Le, & Vinyals. 2015) Stanford CS224S Spring 2017
Listen, Attend, and Spell

(Chan, Jaitly, Le, & Vinyals. 2015) Stanford CS224S Spring 2017
Listen, Attend, and Spell

(Chan, Jaitly, Le, & Vinyals. 2015) Stanford CS224S Spring 2017
Attention-based sequence generation
— Maximum likelihood conditional language model
given the audio

(Bahdanau et al. 2016) Stanford CS224S Spring 2017

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy