Presentation 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

End-to-End Automatic KUNAL DHAWAN

Speech Recognition KUMAR PRIYADARSHI


Meeting 1
End to End ASR:
online libraries and
open source code
ESPnet: end-to-
end speech
processing toolkit
 Based on Chainer and
PyTorch
 Follows Kaldi ASR toolkit style
for data processing, feature
extraction/format, and
recipes to provide a
complete setup for speech
recognition
 Paper:
https://arxiv.org/pdf/1804.00
015.pdf
 Pretty recent , thus has some bugs, but contributors active in solving
them:
2)Eesen
 Based on Kaldi
 Acoustic Model -- Bi-directional RNNs with LSTM units.
 Training -- Connectionist temporal classification (CTC) as the training
objective.
 Decoding -- A principled decoding approach based on Weighted
Finite-State Transducers (WFSTs).
 Paper: https://arxiv.org/pdf/1507.08240.pdf
 Problems : Difficult to
modify and try out new
things using this library
Kaldi

No current implementation
specifically for end to end ASR

But Kaldi now offers tensorflow


integration. This means it would
be easy to try out our own
ideas
Literature Review
• End-to-End Deep Neural Network for Automatic Speech Recognition (2016)
William Song, Jim Cai, Stanford University

 Approach
 CNN for frame level Classification
 RNN with CTC loss for decoding
 Traditioinal Hidden Markov Model not used
 Used Mel logged-filter bank features as input

 Results
 Frame level classification satisfactory
 Decoding scheme needs improvement
Literature Review
• Towards End-To-End Speech Recognition with Deep Convolutional Neural
Networks Bengio et al., Interspeech 2016

 Approach
 CNN for frame level Classification
 No RNN used at all
 CTC loss used for decoding
 Traditioinal Hidden Markov Model not used
 Used Mel logged-filter bank features as input

 Results
 CNN able to capture temporal relations
 Training faster as comapred to RNN models
Literature Review
• End-To-End Speech Recognition from the Raw Waveform (2018)
Zeghidour et al., Facebook A.I.

 Approach
 End-to-End system trained directly from Raw Waveform
 Uses trainable filterbanks in place of log mel-filterbanks
 Uses CNN architecture

 Results
 Improved performance over log mel-filterbanks
Thank you!

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy