0% found this document useful (0 votes)

142 views

CTC

This document outlines a lecture on end-to-end neural network speech recognition. It discusses connectionist temporal classification (CTC), scaling up end-to-end neural approaches, and alternative end-to-end methods. The lecture also includes an outline covering ASR discussions so far and HW3.

Uploaded by

datbaclieu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views

CTC

Uploaded by

datbaclieu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

CS 224S / LINGUIST 285

Spoken Language Processing

Andrew Maas
Stanford University
Spring 2017

Stanford CS224S Spring 2017

Noisy channel model

likelihood prior

ˆ
W = argmax P(O |W )P(W )
W ∈L

Stanford CS224S Spring 2017

The noisy channel model
Ignoring the denominator leaves us with two
factors: P(Source) and P(Signal|Source)

!"#$%&'()!!*+
!"#$%&'!&()&(%&
-.'/#!0%'1&'
)2&'.""3'".'4"5&666 !"#$%&$'!('!)'

!"#$!"%
75&$8'2+998'.+/048 !"#$%&*
*#&!!'+)'!"#$%&, -('+'2"4&'0(')2&'*$"#(3 !"#$%&+
&&&
-.'/#!0%'1&' -.'/#!0%'1&')2&'.""3'".'4"5& !"#$%&,
)2&'.""3'".'4"5&666

Stanford CS224S Spring 2017

HMM for the
digit
recognition
task

Stanford CS224S Spring 2017

Acoustic Modeling with GMMs
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …

Hidden Markov
942 942 6
Model (HMM):

Acoustic Model:
%&*

%&)
%&*

%&)
*&%

)&%
GMM models:
%&(

%&'
%&(

%&'
(&%

'&%
P(x|s)
%&!

%&"
%&!

%&"
!&%

"&%
x: input features
Audio Input: s: HMM state
%&# %&# #&%

%&$ %&$ $&%

% % %
!! !" !# !$ % $ # " ! !! !" !# !$ % $ # " ! ! " # $ % $! #! "! !!

Features Features Features

Stanford CS224S Spring 2017

DNN Hybrid Acoustic Models
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …

Hidden Markov
942 942 6
Model (HMM):

P(s|x1) P(s|x2) P(s|x3) Use a DNN to approximate:

P(s|x)
Acoustic Model:
Apply Bayes’ Rule:
P(x|s) = P(s|x) * P(x) / P(s)

DNN * Constant / State prior

Audio Input: Features (x1) Features (x2) Features (x3)

Stanford CS224S Spring 2017

50
Framework + Isolated Training 45
Limitations Frame Error Rate
Word Error Rate
45
40

RT-03 Word Error Rate

40
Frame Error Rate

30
30

25
25

20 20
GMM 36M 100M 200M 400M
Model Size
(Maas, Qi, Xie, Hannun, Lengerich, Jurafsky, & Ng. 2017) Stanford CS224S Spring 2017
Recurrent DNN Hybrid Acoustic Models
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …

Hidden Markov
942 942 6
Model (HMM):

P(s|x1) P(s|x2) P(s|x3)

Acoustic Model:

Audio Input: Features (x1) Features (x2) Features (x3)

Stanford CS224S Spring 2017

Deep Recurrent Network
Output Layer

Hidden Layer

Input

Stanford CS224S Spring 2017

HMM-Free Recognition
Transcription: Samson
Pronunciation: S – AE – M – S –AH – N
Sub-phones : 942 – 6 – 37 – 8006 – 4422 …

Hidden Markov
942 942 6
Model (HMM):

P(s|x1) P(s|x2) P(s|x3)

Acoustic Model:

Audio Input: Features (x1) Features (x2) Features (x3)

(Graves & Jaitly. 2014) Stanford CS224S Spring 2017

HMM-Free Recognition
Transcription: Samson

Characters:
SAMSON
Collapsing
SS___AA_M_S___O___NNNN
function:

S S _
P(a|x1) P(a|x2) P(a|x3) Use a DNN to approximate:
Acoustic Model: P(a|x)

The distribution over characters

Audio Input:
Features (x1) Features (x2) Features (x3)

(Graves & Jaitly. 2014) Stanford CS224S Spring 2017

Example Results (WSJ)
YET A REHBILITATION CRU IS ONHAND IN THE BUILDING LOOGGING BRICKS PLASTER
AND BLUEPRINS FOUR FORTY TWO NEW BETIN EPARTMENTS
YET A REHABILITATION CREW IS ON HAND IN THE BUILDING LUGGING BRICKS PLASTER
AND BLUEPRINTS FOR FORTY TWO NEW BEDROOM APARTMENTS

THIS PARCLE GUNA COME BACK ON THIS ILAND SOM DAY SOO
THE SPARKLE GONNA COME BACK ON THIS ISLAND SOMEDAY SOON

TRADE REPRESENTIGD JUIDER WARANTS THAT THE U S WONT BACKCOFF ITS PUSH
FOR TRADE BARIOR REDUCTIONS
TRADE REPRESENTATIVE YEUTTER WARNS THAT THE U S WONT BACK OFF ITS PUSH
FOR TRADE BARRIER REDUCTIONS

TREASURY SECRETARY BAGER AT ROHIE WOS IN AUGGRAL PRESSED FOUR ARISE IN

THE VALUE OF KOREAS CURRENCY
TREASURY SECRETARY BAKER AT ROH TAE WOOS INAUGURAL PRESSED FOR A RISE IN
THE VALUE OF KOREAS CURRENCY

Stanford CS224S Spring 2017

Earlier work on CTC with phonemes

(Graves, Fernández, Gomez, & Schmidhuber. 2006) Stanford CS224S Spring 2017
Decoding with a Language Model
Character Error Rate
12
Lexicon [a, …, zebra]
8

Language p(“yeah” | “oh”) 4

Model

0
None Lexicon Bigram
Character
__oo_h__y_e_aa_h
Probabilities
Word Error Rate
40
30
20
10
0
None Lexicon Bigram

(Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224S Spring 2017
Loss functions and architecture
What function to fit How do we approximate
Loss function that function
HMM-DNN uses Neural network
independent per-frame architecture
classification with force HMM-DNN typically fine
alignment hard labels with just DNN
CTC independent per- CTC needs recurrent NN
frame but cleverly
allows for multiple
possible labelings

Stanford CS224S Spring 2017

CTC loss during training

(Graves, Fernández, Gomez, & Schmidhuber. 2006) Stanford CS224S Spring 2017
Recurrence Matters!
S S _
P(a|x1) P(a|x2) P(a|x3) Architecture CER
DNN 22
+ recurrence 13
+ bi-directional 10
recurrence
Features (x1) Features (x2) Features (x3)

(Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224S Spring 2017
CTC Loss Function
Maximum log likelihood training of transcript
Intuition: Alignments are unknown so integrate over
all possible time-character alignments

Example: W = “hi”, T = 3
possible C such that K(C) = W:
hhi, hii, _hi, h_i, hi_

(Graves & Jaitly. 2014) Stanford CS224S Spring 2017

CTC Objective Function
Labels at each time index are conditionally independent
(like HMMs)

Sum over all time-level labelings consistent with the

output label.
Output label: AB
Time-level labelings: AB, _AB, A_B, … _A_B_

Final objective maximizes probability of true labels:

(Graves & Jaitly, ICML 2014) Stanford CS224S Spring 2017

Collapsing Example
Per-frame argmax:
____________________________________________________________________________________________________
yy__ee_________tt_ ____________________________________________a_____
_rr__e________hh__________b___ii_______lll__i_____tt______aa______tt_______iio__n___
___cc_____rrr_u_____________________ ________ii___ss
______________o__________nn_____________hhh_a___________________nnddd ________________i__n___
__thh_e_____ __________________________________________bb_uuii_______lllldd____ii____nng_____
___________________________________l___o___o_g__g___ii____nng______
____b___rr_ii________ck__s__________________________________________p___ll__a________sstt_________eerr__
______a___nnd_ ___b___lll_uu____ee__pp___r___i________nnss_
________________f______oou____________rrr________ _____________f_____oo__rrr__tt_y____
_____t____www_oo__________ ____nn___ew___________________
______________________________________________________b___e_______t__________i____n___
____e________pp_____aa___rr___tt____mm_ee___nnntss
____________________________________________________________________________________________________
_________________________________

After collapsing:
yet a rehbilitation cru is onhand in the building loogging bricks plaster and blueprins four forty two new betin epartments

Reference:
yet a rehabilitation crew is on hand in the building lugging bricks plaster and blueprints for forty two new bedroom
apartments

(Hannun, Maas, Jurafsky, & Ng. 2014) Stanford CS224S Spring 2017
Rethinking Decoding
Out of Vocabulary Words
syriza bae
abo--
Lexicon [a, …, zebra] sof--
schmidhuber

Language p(“yeah” | “oh”) Character p(h | o,h, ,y,e,a,)

Model Language
Model

Character Character
__oo_h__y_e_aa_h __oo_h__y_e_aa_h
Probabilities Probabilities

(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Beam Search Decoding

Stanford CS224S Spring 2017

Lexicon-Free & HMM-Free on Switchboard
40

0
HMM-GMM CTC No LM CTC + 7-gram CTC + NN LM HMM-DNN

(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Example Results (Switchboard) ~19%
CER
i i don'tknow i don't know what the rain force have to do with it but you know their
chop a those down af the tr minusrat everyday
i- i don't kn- i don't know what the rain forests have to do with it but you know they're
chopping those down at a tremendous rate everyday

come home and get back in to regular cloos aga

come home and get back into regular clothes again

i guess down't here u we just recently move to texas so my wor op has change quite a
bit muh we ook from colorado were and i have a cloveful of sweatterso tuth
i guess down here uh we just recently moved to texas so my wardrobe has changed
quite a bit um we moved from colorado where and i have a closet full of sweaters that

i don't know whether state lit state hood whold itprove there a conomy i don't i don't
know that to that the actove being a state
i don't know whether state woul- statehood would improve their economy i don't i
don't know that the ve- the act of being a state

(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Comparing CLMs
Switchboard Word Error Rate
40

0
No LM 5-gram 7-gram NN 1H NN 3H RNN 1H RNN 3H
All NN models have 5M total parameters
(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Transcribing Out of Vocabulary Words
Truth: yeah i went into the i do not know what you think of fidelity but
HMM-GMM: yeah when the i don’t know what you think of fidel it even them
CTC-CLM: yeah i went to i don’t know what you think of fidelity but um

Truth: no no speaking of weather do you carry a altimeter slash barometer

HMM-GMM: no i’m not all being the weather do you uh carry a uh helped emitters last brahms her
CTC-CLM: no no beating of whether do you uh carry a uh a time or less barometer

Truth: i would ima- well yeah it is i know you are able to stay home with them
HMM-GMM: i would amount well yeah it is i know um you’re able to stay home with them
CTC-CLM: i would ima- well yeah it is i know uh you’re able to stay home with them

(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Comparing Alignments

HMM-GMM phone probabilities CTC character probabilities

(HMM slide from Dan Ellis) Stanford CS224S Spring 2017

Learning Phonemes and Timing
Take all phone segments from
HMM-GMM alignments (k)
Align all segments to start at
the same time = 0
Compute the average CTC
character probabilities during
the segment (c, e, k)
Vertical line shows median
end time of phone segment
from HMM-GMM alignments

Stanford CS224S Spring 2017

Learning Phonemes and Timing

(Maas*, Xie*, Jurafsky, & Ng. 2015) Stanford CS224S Spring 2017
Scaling end to end models: Baidu
deep speech

(Hannun et al. 2014) Stanford CS224S Spring 2017

Deep Speech – Deep RNN

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech – Batch Norm for RNNs

Normalize

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech – Batch Norm for RNNs

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech - Hours of speech data

Language Hours
English 12,000
Mandarin 10,000

Where does the data come from?

● Public benchmarks (English)

● Internal manually labelled data (English and Mandarin)

● Captioned videos (English and Mandarin)

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech - Captioned Video Data Pipeline

1. Download publicly available video + captions.

1. Align caption to video with CTC Model

1. Segment at regions of silence

1. Use simple classifier to throw out very noisy

samples.

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech - Captioned Video Data Pipeline
Align with a model trained with CTC?

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech - Even more data!
Augmentation: noise synthesis, reverb, time-stretching, pitch-shifting,...

Speech

Noise

Noisy
Speech

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech – Data Parallel GPU Scaling

Model 1 Model 2

Share weight
updates each
iteration

InfiniBand

Model 3 Model 4

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech – Data Parallel GPU Scaling
Custom Ring Reduce avoids extraneous copies to CPU memory.

# GPUs OpenMPI All-reduce Custom All-reduce Factor Speedup

(s)* (s)*
4 55359 2587 21.4
8 48881 2470 19.8
16 21562 1393 15.5

*Measures time spent in all-reduce for a single epoch.

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech – Data Parallel GPU Scaling

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech – Some results

Architecture English (WER) Mandarin (WER)

5-layer 1-RNN 13.55 15.41

5-layer 3-RNN 11.61 11.85

5-layer 3-RNN + BatchNorm 10.56 9.39

9-layer 7-RNN + BatchNorm + 9.52 7.93

Frequency Convolution

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech – Deployment

● Bi-directional models give almost 10% relative boost …

but we can’t deploy them.

● ASR latencies for voice search <50ms

● For 3 second audio would need to decode 60x faster than

realtime!

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech – Lookahead convolution

Slides from Awni Hannun Stanford CS224S Spring 2017

Deep Speech – Lookahead convolution

For a lookahead of 20 time-steps (about 800ms in the

future)

Model English (WER) Chinese (WER)

Forward only 18.8 15.7
Forward + Lookahead (+50k params) 16.8 13.5
Bidirectional (+12M params) 15.4 12.8

Slides from Awni Hannun Stanford CS224S Spring 2017

Listen, Attend, and Spell

(Chan, Jaitly, Le, & Vinyals. 2015) Stanford CS224S Spring 2017
Listen, Attend, and Spell

(Chan, Jaitly, Le, & Vinyals. 2015) Stanford CS224S Spring 2017
Attention-based sequence generation
Maximum likelihood conditional language model
given the audio

(Bahdanau et al. 2016) Stanford CS224S Spring 2017

Solutions
No ratings yet
Solutions
11 pages
Hanen Program
100% (6)
Hanen Program
22 pages
IB English Interactive Oral Reflection
No ratings yet
IB English Interactive Oral Reflection
1 page
T4 - Towards End-To-End Speech Recognition PDF
No ratings yet
T4 - Towards End-To-End Speech Recognition PDF
177 pages
End-To-End Speech Recognition Models
No ratings yet
End-To-End Speech Recognition Models
94 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Deep Speech 3 1707.07413
No ratings yet
Deep Speech 3 1707.07413
8 pages
Cs 224N: Assignment #4: 1. Neural Machine Translation With Rnns (45 Points)
No ratings yet
Cs 224N: Assignment #4: 1. Neural Machine Translation With Rnns (45 Points)
10 pages
Voice Assistant (4)
No ratings yet
Voice Assistant (4)
34 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
cs224n spr2024 Lecture01 Wordvecs1
No ratings yet
cs224n spr2024 Lecture01 Wordvecs1
40 pages
Cs 224N: Assignment #4: 1. Neural Machine Translation With Rnns (45 Points)
No ratings yet
Cs 224N: Assignment #4: 1. Neural Machine Translation With Rnns (45 Points)
7 pages
1507 08240
No ratings yet
1507 08240
8 pages
Adobe Scan 18 Mar 2025
No ratings yet
Adobe Scan 18 Mar 2025
3 pages
DL4CV-Seq-Att
No ratings yet
DL4CV-Seq-Att
63 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
Eisenstein
No ratings yet
Eisenstein
305 pages
Sign Language Translation Using Machine Learning and Computer Vis
No ratings yet
Sign Language Translation Using Machine Learning and Computer Vis
34 pages
gnns
No ratings yet
gnns
75 pages
Objection detection
No ratings yet
Objection detection
25 pages
DOC-20250318-WA0029
No ratings yet
DOC-20250318-WA0029
24 pages
Neubig 16 Afnlp
No ratings yet
Neubig 16 Afnlp
58 pages
Voice Assistant (1)
No ratings yet
Voice Assistant (1)
30 pages
Deep Speech - Scaling Up End-To-End Speech Recognition
No ratings yet
Deep Speech - Scaling Up End-To-End Speech Recognition
12 pages
Kalpesh Resume
No ratings yet
Kalpesh Resume
4 pages
BTP Thesis rs1 End-To-End-Asr
No ratings yet
BTP Thesis rs1 End-To-End-Asr
51 pages
DOC-20241111-WA0002.
No ratings yet
DOC-20241111-WA0002.
10 pages
04-GNN1
No ratings yet
04-GNN1
73 pages
21 Conclusion
No ratings yet
21 Conclusion
78 pages
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
No ratings yet
Pervasive Attention 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
11 pages
DL mod 3
No ratings yet
DL mod 3
4 pages
Introduction to Deep Learning 17th January 2025 (2)
No ratings yet
Introduction to Deep Learning 17th January 2025 (2)
60 pages
Cs224n 2025 Lecture03 Neuralnets
No ratings yet
Cs224n 2025 Lecture03 Neuralnets
96 pages
Lexicon-Free Conversational Speech Recognition With Neural Networks
No ratings yet
Lexicon-Free Conversational Speech Recognition With Neural Networks
10 pages
Speech Recognition Using HMM ANN Hybrid Model
No ratings yet
Speech Recognition Using HMM ANN Hybrid Model
4 pages
CS 224W 02-Nodeemb
No ratings yet
CS 224W 02-Nodeemb
71 pages
Mba-Ai Speech Technologies: Prof. Brian Mak
No ratings yet
Mba-Ai Speech Technologies: Prof. Brian Mak
56 pages
NLP Short Que Ans
No ratings yet
NLP Short Que Ans
21 pages
Ishigurognnintroduction201023 201027054344
No ratings yet
Ishigurognnintroduction201023 201027054344
81 pages
MLUD UNIT-3(B)
No ratings yet
MLUD UNIT-3(B)
7 pages
03-GNN1
No ratings yet
03-GNN1
71 pages
Unit 2
No ratings yet
Unit 2
48 pages
XCS224N Assignment 4 Neural Machine Translation With Rnns
No ratings yet
XCS224N Assignment 4 Neural Machine Translation With Rnns
10 pages
ProjectThemes
No ratings yet
ProjectThemes
23 pages
Joining Advantages of Word-Conditioned and Token-P
No ratings yet
Joining Advantages of Word-Conditioned and Token-P
5 pages
Introduction To Neural Networks and Machine Learning Lecture 4: Learning To Model Relationships and Word Sequences
No ratings yet
Introduction To Neural Networks and Machine Learning Lecture 4: Learning To Model Relationships and Word Sequences
21 pages
Graph Convolutional Networks For Named Entity Recognition: Gcns NER
No ratings yet
Graph Convolutional Networks For Named Entity Recognition: Gcns NER
9 pages
How To Use An Existing DNN Recognizer For Decoding in Kaldi
No ratings yet
How To Use An Existing DNN Recognizer For Decoding in Kaldi
14 pages
Lecture Notes 01
No ratings yet
Lecture Notes 01
77 pages
Recent Progresses in Deep Learning Based Acoustic Models: Dong Yu and Jinyu Li
No ratings yet
Recent Progresses in Deep Learning Based Acoustic Models: Dong Yu and Jinyu Li
14 pages
Lecture1 PDF
No ratings yet
Lecture1 PDF
39 pages
Convolutional Neural Network (CNN)
No ratings yet
Convolutional Neural Network (CNN)
25 pages
A Sensitivity Analysis of Convolutional Neural Networks For Sentence Classification
No ratings yet
A Sensitivity Analysis of Convolutional Neural Networks For Sentence Classification
18 pages
SP14 CS188 Lecture 15 -- Particle Filters and Applications of HMMs - Print
No ratings yet
SP14 CS188 Lecture 15 -- Particle Filters and Applications of HMMs - Print
41 pages
Unit-III-1
No ratings yet
Unit-III-1
11 pages
End-To-End Learning of Latent Edge Weights For Graph Convolutional Networks
No ratings yet
End-To-End Learning of Latent Edge Weights For Graph Convolutional Networks
49 pages
Hidden Markov Models and POS Tagging
No ratings yet
Hidden Markov Models and POS Tagging
156 pages
The Mathematical Theory of Relativity
From Everand
The Mathematical Theory of Relativity
Sir Arthur Stanley Eddington
No ratings yet
Winds of Folly
From Everand
Winds of Folly
Seth Hunter
No ratings yet
C Language Programming Codes
From Everand
C Language Programming Codes
Durgesh
No ratings yet
Comprehensive CSS3 Command List, With Descriptions And Typical Mark Up
From Everand
Comprehensive CSS3 Command List, With Descriptions And Typical Mark Up
Online Trainees
No ratings yet
Strategic Plans Brighton Local Area Plan
No ratings yet
Strategic Plans Brighton Local Area Plan
82 pages
Planning Minutes 12-2-19
No ratings yet
Planning Minutes 12-2-19
41 pages
Maintenance Contract Announced 6 Sept 2012.PDF - Coredownload
No ratings yet
Maintenance Contract Announced 6 Sept 2012.PDF - Coredownload
2 pages
PTE听力机经 Summarize Spoken Text 自己整理版
No ratings yet
PTE听力机经 Summarize Spoken Text 自己整理版
12 pages
DL Reading Notes (Acad - GT)
No ratings yet
DL Reading Notes (Acad - GT)
12 pages
Dissertation WPR
0% (1)
Dissertation WPR
20 pages
Jose Cruz Plato Work 90
No ratings yet
Jose Cruz Plato Work 90
7 pages
Is Video Gaming Good or Bad Position Paper
No ratings yet
Is Video Gaming Good or Bad Position Paper
2 pages
Previous Answer's - Feedback
100% (1)
Previous Answer's - Feedback
25 pages
Doodle
No ratings yet
Doodle
2 pages
Recent Trends On Hybrid Modeling For Indust - 2021 - Computers - Chemical Engine
No ratings yet
Recent Trends On Hybrid Modeling For Indust - 2021 - Computers - Chemical Engine
21 pages
Semester 1 French
No ratings yet
Semester 1 French
44 pages
1-MEM661 Lab Report Format, Rubric & Marks
No ratings yet
1-MEM661 Lab Report Format, Rubric & Marks
3 pages
A Study On Employee Perception Towards Organisational Commitment and Its Impact On Employee's Performance With Special Reference To Information Technology Sector Employees With Chennai Region
No ratings yet
A Study On Employee Perception Towards Organisational Commitment and Its Impact On Employee's Performance With Special Reference To Information Technology Sector Employees With Chennai Region
11 pages
Interpersonal Understanding and Theory of Mind: Ylva Gustafsson
100% (1)
Interpersonal Understanding and Theory of Mind: Ylva Gustafsson
329 pages
Emotion
No ratings yet
Emotion
28 pages
The Socratic Method Online To Improve CR PDF
No ratings yet
The Socratic Method Online To Improve CR PDF
10 pages
Writing Skill - Informal Letter
No ratings yet
Writing Skill - Informal Letter
6 pages
Test 2
No ratings yet
Test 2
3 pages
Ecuador's Minister of Education Response To Rosa Maria Torres About The Goverment Declaring Ecuador A
No ratings yet
Ecuador's Minister of Education Response To Rosa Maria Torres About The Goverment Declaring Ecuador A
3 pages
ENCh27
No ratings yet
ENCh27
10 pages
Stress & Hippocampus
No ratings yet
Stress & Hippocampus
26 pages
Week 02-English 2-Parallel Structure
No ratings yet
Week 02-English 2-Parallel Structure
9 pages
Mirror Neuron System
No ratings yet
Mirror Neuron System
3 pages
Visual Rhetoric Assignment
No ratings yet
Visual Rhetoric Assignment
4 pages
Afl Implementation File
No ratings yet
Afl Implementation File
42 pages
DLL - English 3 - Q4 - W6
No ratings yet
DLL - English 3 - Q4 - W6
4 pages
Topic 2 Research in Industrial Psychology PDF
No ratings yet
Topic 2 Research in Industrial Psychology PDF
9 pages
Lesson Plan Template - Dividin Decimal by Whole Numbers
No ratings yet
Lesson Plan Template - Dividin Decimal by Whole Numbers
3 pages
Abstract - The Teaching of Basketball As A Process of Educational Subjectification - Bel
No ratings yet
Abstract - The Teaching of Basketball As A Process of Educational Subjectification - Bel
4 pages
Special Numbers Investigation - Grade 8
No ratings yet
Special Numbers Investigation - Grade 8
2 pages
Notes On Introduction To Linguistics
100% (1)
Notes On Introduction To Linguistics
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

CTC

Uploaded by

CTC

Uploaded by

CS 224S / LINGUIST 285

Spoken Language Processing

Lecture 8: End-to-end neural network speech

Stanford CS224S Spring 2017

Stanford CS224S Spring 2017

Stanford CS224S Spring 2017

Stanford CS224S Spring 2017

%&$ %&$ $&%

Features Features Features

Stanford CS224S Spring 2017

P(s|x1) P(s|x2) P(s|x3) Use a DNN to approximate:

DNN * Constant / State prior

Audio Input: Features (x1) Features (x2) Features (x3)

Stanford CS224S Spring 2017

RT-03 Word Error Rate

P(s|x1) P(s|x2) P(s|x3)

Audio Input: Features (x1) Features (x2) Features (x3)

Stanford CS224S Spring 2017

Stanford CS224S Spring 2017

P(s|x1) P(s|x2) P(s|x3)

Audio Input: Features (x1) Features (x2) Features (x3)

(Graves & Jaitly. 2014) Stanford CS224S Spring 2017

The distribution over characters

(Graves & Jaitly. 2014) Stanford CS224S Spring 2017

TREASURY SECRETARY BAGER AT ROHIE WOS IN AUGGRAL PRESSED FOUR ARISE IN

Stanford CS224S Spring 2017

Language p(“yeah” | “oh”) 4

Stanford CS224S Spring 2017

(Graves & Jaitly. 2014) Stanford CS224S Spring 2017

Sum over all time-level labelings consistent with the

Final objective maximizes probability of true labels:

(Graves & Jaitly, ICML 2014) Stanford CS224S Spring 2017

Language p(“yeah” | “oh”) Character p(h | o,h, ,y,e,a,)

Stanford CS224S Spring 2017

come home and get back in to regular cloos aga

Truth: no no speaking of weather do you carry a altimeter slash barometer

HMM-GMM phone probabilities CTC character probabilities

(HMM slide from Dan Ellis) Stanford CS224S Spring 2017

Stanford CS224S Spring 2017

(Hannun et al. 2014) Stanford CS224S Spring 2017

Slides from Awni Hannun Stanford CS224S Spring 2017

Slides from Awni Hannun Stanford CS224S Spring 2017

Slides from Awni Hannun Stanford CS224S Spring 2017

Where does the data come from?

● Internal manually labelled data (English and Mandarin)

● Captioned videos (English and Mandarin)

Slides from Awni Hannun Stanford CS224S Spring 2017

1. Download publicly available video + captions.

1. Align caption to video with CTC Model

1. Segment at regions of silence

1. Use simple classifier to throw out very noisy

Slides from Awni Hannun Stanford CS224S Spring 2017

Slides from Awni Hannun Stanford CS224S Spring 2017

Slides from Awni Hannun Stanford CS224S Spring 2017

Slides from Awni Hannun Stanford CS224S Spring 2017

# GPUs OpenMPI All-reduce Custom All-reduce Factor Speedup

*Measures time spent in all-reduce for a single epoch.

Slides from Awni Hannun Stanford CS224S Spring 2017

Slides from Awni Hannun Stanford CS224S Spring 2017

Architecture English (WER) Mandarin (WER)

5-layer 1-RNN 13.55 15.41

5-layer 3-RNN 11.61 11.85

5-layer 3-RNN + BatchNorm 10.56 9.39

9-layer 7-RNN + BatchNorm + 9.52 7.93

Slides from Awni Hannun Stanford CS224S Spring 2017

● Bi-directional models give almost 10% relative boost …

● ASR latencies for voice search <50ms

● For 3 second audio would need to decode 60x faster than

Slides from Awni Hannun Stanford CS224S Spring 2017

Slides from Awni Hannun Stanford CS224S Spring 2017

For a lookahead of 20 time-steps (about 800ms in the

Model English (WER) Chinese (WER)

Slides from Awni Hannun Stanford CS224S Spring 2017

(Bahdanau et al. 2016) Stanford CS224S Spring 2017

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.