BTP Thesis rs1 End-To-End-Asr

End-to-End Automatic Speech Recognition
A thesis submitted in partial fulfillment of

the requirements for the degree of
Bachelor of Technology
by
Kunal Dhawan and Kumar Priyadarshi

(Roll No. 150102030,150102074)
Under the guidance of

Prof. Rohit Sinha
DEPARTMENT OF ELECTRONICS & ELECTRICAL ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI
April 2019
CERTIFICATE
This is to certify that the work contained in this thesis entitled
End-to-End Automatic Speech Recognition
is the work of
Kunal Dhawan and Kumar Priyadarshi

(Roll No. 150102030,150102074)
for the award of the degree of Bachelor of Technology, carried out in the
Department of Electronics and Electrical Engineering, Indian Institute of
Technology Guwahati under my supervision and that it has not been
submitted elsewhere for a degree.
Guide
Date:
Place:
DECLARATION
The work contained in this thesis is our own work under the supervision of the
guides. We have read and understood the “B. Tech./B. Des. Ordinances and
Regulations” of IIT Guwahati and the “FAQ Document on Academic Malprac-
tice and Plagiarism” of EEE Department of IIT Guwahati. To the Best of our
knowledge, this thesis is an honest representation of our work.
Author
Date:
Place:
Acknowledgments
First and foremost, we would like to express our heartfelt gratitude towards prof. Rohit Sinha
for always being there to guide us irrespective of his busy schedule. We learnt a lot from Sir and
would always be grateful to sir for giving us the opportunity to work on this ambitious project.
Further, we would like to sincerely thank Mr. Ganji Sreeram for his constant support and help
throughout the project. This thesis would not have been possible without his valuable inputs
and debugging discussions.
Abstract
The goal of an automatic speech recognition system (ASR) is to accurately and efficiently con-
vert a given speech signal into its corresponding text transcription of the spoken words, irre-
spective of the recording device, the speaker’s accent, or the acoustic environment. To achieve
this, several models such as dynamic time warping, hidden Markov models and deep neural net-
works have been proposed over time in literature. These conventional models give rise to very
complicated systems consisting of various sub-modules. An end-to-end (E2E) automatic speech
recognition system greatly simplifies this pipeline by replacing these complicated sub-modules
with a deep neural network architecture employing data-driven linguistic learning methods. As
the target labels are learned directly from speech data, the E2E systems need a bigger corpus
for effective training. In this work, we try to understand the working of these E2E ASR systems
and extend their prowess to tasks where comparatively less amount of training data is avail-
able. For working on such problem, the task of code-switching speech recognition is chosen.
Code-switching refers to the phenomenon of switching between two or more languages while
speaking in multilingual communities. In this work, we aim to address two important prob-
lems in code-switching domain using end-to-end approach. The first problem is the automatic
speech recognition of code-switching data which cannot be directly tackled using conventional
approaches due to multiple languages involved. The second task is the word-level automatic
language identification (LID) in the context of intra-sentential code-switching. All experimen-
tal validations have been performed on recently created IITG HingCoS (Hindi-English code-
switching) corpus. Also, results for conventional deep neural network - hidden Markov model
based system have been reported for the purpose of contrast.
Keywords: end-to-end speech recognition, language identification, code-switching, at-
tention mechanism
iv
Contents
Abstract iv
List of Figures vii
Nomenclature ix
1 Introduction 1
1.1 Conventional ASR systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Deep neural network - hidden Markov model) . . . . . . . . . . . . . . 2
1.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 End-to-end ASR systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Code switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Language identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature survey 8
3 Theoretical background 11
3.1 Connectionist temporal classification . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Sequence to sequence attention mechanism . . . . . . . . . . . . . . . . . . . 14
3.3 Listen, attend and spell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Encoder (listener) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Decoder (speller) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
v
3.3.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.5 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Exploring end-to-end solutions in code-switching domain 19

4.1 Unified character set approach . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Proposed target set reduction scheme . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Proposed joint LID scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Experimental setup 24
5.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.1 Monolingual database - TIMIT . . . . . . . . . . . . . . . . . . . . . . 24
5.1.2 Code-switching database - HingCoS . . . . . . . . . . . . . . . . . . . 25
5.2 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 System description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.1 Monolingual experiments . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.2 Code-switching experiments . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.3 Experimental setup for LID . . . . . . . . . . . . . . . . . . . . . . . 28
6 Results and discussions 29

6.1 Results on TIMIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Results on HingCoS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2.1 ASR results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2.2 LID results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7 Conclusion and future work 35
8 List of publications 37
vi
List of Figures
1.1 Components of an ASR pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Block diagram of E2E Networks using: a) CTC mechanism, and b) attention
mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Multi-Task Learning paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 Recurrent neural network architecture (image adapted from [1]) . . . . . . . . 11

3.2 Architecture of the CTC-based E2E network. The encoder is a deep network
consisting of BLSTMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 A seq2seq model (image adapted from [2]) . . . . . . . . . . . . . . . . . . . 14
3.4 A seq2seq model with attention . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Architecture of LAS network. It consists of three modules namely: listener
(encoder), attender (alignment generator), and speller (decoder). . . . . . . . . 16
4.1 The top two rows show the default Hindi and English character sets, respec-
tively. The proposed reduced target labels covering both Hindi and English sets
are shown in the bottom row. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Sample examples for the proposed common phone level labelling and existing
character level labelling schemes for E2E ASR system training. Note that for
the given words, the unique targets when tokenized at character level turns out
to be 22 and 12 unique targets when tokenized using the proposed scheme. . . . 21
4.3 Creation of character-level LID tags for the training data towards conditioning
the E2E networks to perform LID task on code-switching speech. The ‘H/E’
denotes Hindi/English LID tag. The ‘b/e’ label is appended to the ‘H/E’ LID
tag to mark the begin/end characters. . . . . . . . . . . . . . . . . . . . . . . . 23
vii
5.1 Sample code-switching sentences in HingCoS corpus and their corresponding
English translations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1 Visualization of attention mechanism for LID task. For a given Hindi-English
code-switching utterance: a) spectrogram labeled with Hindi and English word
boundaries for reference purpose. (b) variation of attention weights with respect
to time for Hindi and English languages, and (c) alignment produced by the
attention network for the input speech and the decoded output LID labels. . . . 33
6.2 Visualization of attention mechanism for LID task - second example . . . . . . 33
viii
Nomenclature
ASR Automatic Speech Recognition
E2E End-to-End
HMM Hidden Markov Model
DNN Deep Neural Network
CTC Connectionist Temporal Classification
RNN Recurrent Neural Networks
LAS Listen Attend and Spell
BLSTM Bidirectional Long Short Term Memory
WER Word Error Rate
PER Phoneme Error Rate
CER Character Error Rate
ix
Chapter 1
Introduction
Automatic speech recognition aims at enabling devices to correctly and efficiently identify spo-
ken language and convert it into text. Some of the most important applications of speech recog-
nition include speech-to-text processing, audio information retrieval, keyword search and gen-
erating streaming captions in videos. From a technology viewpoint, speech recognition has
evolved with several waves of novel innovations. Traditional general-purpose speech recog-
nition systems are based on hidden Markov models. This when combined with deep neural
networks led to major improvement in various components of the speech recognition pipeline.
1.1 Conventional ASR systems

Historically, most speech recognition systems have been based on statistical models represent-
ing the various sounds/phones of the language to be recognized. In this section, we review some
traditional ASR methods and their sub-modules.
1.1.1 Hidden Markov models
The hidden Markov model (HMM) is an important framework for construction of speech recog-
nition systems because speech has temporal structure and it can be efficiently encoded as a
sequence of spectral vectors inside the audio frequency range. A conventional ASR pipeline
consists of a feature extraction module (typically Mel-frequency cepstral coefficients are used
as features), a hidden Markov model & Gaussian mixture model (HMM-GMM) based acoustic
model which captures the information about how a word has been pronounced and a n-gram
language model which helps to ascertain what sequence of words can possible occur in a given
1
language. This pipeline has been graphically depicted in figure 1.1.
Language
model
Feature Feature
Speech Decoder Words
Extraction vectors
Acoustic Pronunciation
model dictionary
Figure 1.1: Components of an ASR pipeline
1.1.2 Deep neural network - hidden Markov model)
HMM has traditionally been combined with Gaussian mixture model (GMM) to do acoustic
modelling for speech recognition , where HMM models the probability of going from one
acoustic state (which is typically a triphone) to another and GMM models the probability of
occurrence of that particular acoustic state. But it has been experimentally observed that using
deep neural networks instead of GMM to model the (emission) probability of an acoustic state
has lead to better results and hence deep neural network- hidden Markov models (DNN-HMM)
systems are being widely used for acoustic modelling in speech recognition currently.
DNN is a feedforward artificial neural network having several hidden layers enabling it to learn
complex representations. Each hidden unit in a layer uses non-linear functions like tanh to map
the feature input from previous layer to the next layer. The outputs of DNN are then fed to
HMM.
Let us now perform a mathematical analysis of an ASR system and understand the impor-
tance of each of the stages as discussed in the pipeline. Automatic speech recognition deals with
mapping of speech from a T -length speech input feature sequence, X = (xt ∈ RD |t = 1, ..., T ),
to an N -length word sequence, W = (wn ∈ V |n = 1, ..., N ). Here, xt is a D-dimensional speech
feature vector at frame t, and wn is a word in the vocabulary, V. We mathematically formulate
2
the ASR problem using Bayes decision theory, thus the task reduces to the estimation of the
most probable word sequence Ŵ, among all word sequences V∗ that are possible, as follows:
Ŵ = argmax p(W|X)
W∈V∗
So, the ASR problem reduces to obtaining the posterior distribution p(W|X). Assuming
a hybrid DNN-HMM model, we use the HMM state sequence S = (st ∈ 1, .., J|t = 1, ..., T )
(where J is the total number of HMM states) to factorizes p(W|X) as follows:
Ŵ = argmax p(W|X)
W∈V∗
X (1.1)
= argmax p(X|S, W)p(S|W)p(W)
W∈V∗
S
Assuming conditional independence p(X|S, W) ≈ p(X|S) (which is a reasonable assumption

to simplify the dependency of the acoustic model),
X
Ŵ ≈ argmax p(X|S)p(S|W)p(W)
W∈V∗
S
These three distributions p(X|S), p(S|W) and p(W) represent the acoustic, lexicon and lan-
guage model in the conventional ASR pipeline respectively.
1.1.3 Limitations
This conventional ASR system based on DNN-HMM is a pretty complicated system consisting
of various sub-modules dealing with separate acoustic, pronunciation and language models (as
seen in above derivation). Listed below are some factors that make this method sub-optimal
with regards to speech recognition performance:
1. Many module-specific processes are required for efficient working of the final model.
2. Curating pronunciation lexicon and defining phoneme sets for the particular language
requires expert linguistic knowledge, and is time-consuming.
3. These systems make conditional independence assumptions to use Gaussian mixture

models, DNN, and n-gram language models. Real speech may not necessarily follow
those assumptions.
3
4. All the different modules are optimized separately with different objectives, which re-
sults in a sub-optimal model as a whole as the individual modules are not optimized to
match the constraints of other sub-modules. In addition, the training objectives and final
evaluation metrics are very different from each other.
1.2 End-to-end ASR systems

End-to-end automatic speech recognition has the goal of simplifying this module-based archi-
tecture and replace the entire traditional pipeline with a deep neural network architecture which
is trained in an E2E fashion to eliminate the above outlined issues. Because of recent devel-
opments in deep learning and big data, E2E systems have benefited greatly and are the current
state-of-the-art in speech recognition. All the modern ASR systems deployed in devices in-
cluding Alexa, Siri and Google voice search are based on E2E architectures. The process of
directly modelling the output labels given the acoustic feature is usually achieved by employ-
ing two techniques: (i) connectionist temporal classification (CTC) [3, 4], and (ii) sequence
to sequence modelling with attention mechanism [5–8]. CTC allows us to train E2E models
without the requirement of alignment between input features and output labels as required in
conventional systems. It is used as cost function along with deep bi-directional long short term
memory (DBLSTM) architecture to build ASR systems. Attention-based systems consist of
three modules: (i) a pyramidal BLSTM network which acts as the acoustic model encoder,
(ii) an attention layer which helps choose input frames to look at while making current label
decision, and (iii) an LSTM network which acts as the decoder.
Conventionally, the E2E systems have been trained for characters as output labels which
simplifies the process of data preparation. In [9], it is shown that grapheme-based E2E ASR
systems slightly outperform phoneme-based systems when a large amount of data (>12,500
hrs) is used for training. In this work, we aim at evaluating this result for a low resource task
and thus compare the performance of phoneme-based and character-based systems.
1.3 Code switching

Code-switching is a common phenomenon in which people switch between languages for the
ease of expression [10]. It has been observed that people use words of a foreign language while
4
𝑃 𝑦1 𝑥) 𝑃 𝑦𝑇 𝑥) 𝑃 𝑦𝑢 𝑦𝑢−1 , … , 𝑥)
...
a) Softmax b) Softmax
ℎ𝑒𝑛𝑐 ℎ𝑢𝑑𝑒𝑐
DBLSTM LSTM
encoder Decoder
... 𝑦𝑢−1 𝑐𝑢
𝑥1 𝑥𝑇
Attention
𝑎𝑡𝑡
ℎ𝑢−1 ℎ𝑒𝑛𝑐
DBLSTM
encoder
...
𝑥1 𝑥𝑇
Figure 1.2: Block diagram of E2E Networks using: a) CTC mechanism, and b) attention mech-
anism.
conversing in their native tongue so as to effectively communicate with other people [11, 12].
The recent spread of urbanization and globalization have positively impacted the growth of
bilingual/multilingual communities and hence made this phenomenon more prominent. The
growth in such communities has made automatic recognition of code-switching speech an im-
portant area of interest [13–15]. In India, Hindi is the native language of around 50% of its
1.32 billion population [16]. A large portion of the remaining half, especially those residing
in metropolitan cities understand the Hindi language well enough. Due to prominence in ad-
ministration, law and corporate world, English language is also used by around 125 million
people in India. Thus, Indians naturally tend to use some English words within their Hindi
discourse, which is referred to as Hindi-English code-switching [17,18]. Despite the increasing
code-switching phenomenon, the research activity in this area is somewhat limited due to lack
of resources, specially for building robust code-switching ASR systems. A few examples of dif-
ferent code-switching varieties is presented in Table 1.1. Note that type 1 and type 2 categories
refer to high and low contextual information carried by the non-native language (here English),
respectively. Note that the experiments in this work have been carried out on intra-sentential
code-switching speech corpus.
5
ककपयत मतझके मकेरत चतलल खततत शकेष रतरश बततएए
Hindi
कतयर्यों कके नतम भबी छहोटके अक्षर सके हबी शतरू हहोतके हहैं
Table
मकेरत atm1: पतपरकExample
खहो रयत हह तहो महHinglish
अपनके भतरततन कहो sentences showing
the inter-sentential
कहसके रहोक सकतत हत हुँ we briefly revie
code-switching and the variants of the intra-sentential code-
can you tell me the departure time of deccan queen
switching. Type-1 and Type-2 variants of intra-sentential code- marizing their
please tell me my current account balance
English switching refer to high and low contextual information being carried
the names of the functions also start with a lowercase letter
by
my the non-native
atm card (English)
is lost so how can I stopwords, respectively.
my payment
• The CUM
speech cor
she is the daughter of ceo, वह यहतहुँ दहो रदन कके रलए आई हह al., at the
Inter-sentential
मतझके अमकेरबीकत ममें चतर सतल हहो रए, but I still miss my country It contains
Type-1
मतझके मकेरत current account balance जतननत हह the speake
भतरत ममें popular free virtual credit card services रकतनबी हहैं
Intra-sentential read by 40
अपनके budget कके अनतसतर investments कर सकतके हहैं
Type-2
class और object कके बबीच relationship क्यत हह
• A small M
corpus was
Table 1.1: phenomenon
Sample Hinglish is also observed
sentences showing theindifferent
chats,varieties
comments, and
of code-switching data.
Dau-cheng
messages posted on the social media sites like Facebook,
tains 4000
1.4 Language identification
Twitter, WhatsApp, YouTube, etc. Table 1 shows a few
ances recor
example sentences of different modes of code-switching
The task of while
detecting highlighting
the languagesthe differences
present in spoken orinwritten
the contextual infor- is referred
data using machines • The Engli
mation carried by the non-native words. In Type-1 intra-
to as language identification. It finds applications in many areas including automatic recognitionwas compi
sentential code-switching, the non-native language words versity of T
of code-switching
either speech.
occur in Traditionally,
sequence LID has been
or form performed
a phrase, by employing
thus carry some separate large
of transcri
contextual information. Whereas, in Type-2 case, the non-
vocabulary continuous speech recognisers (LVCSRs) [19]. Thus, in this approach, separateers.
native language words are embedded into the native lan-
LID systems are built for each of the languages present in the data independently and task of
guage sentences in such a manner that virtually no con- • The SEAM
each systemtextual
is to predict that which of the words in a given sentence
information could be derived from those words. belong to that particularconversatio
language. InAlso, during
this work, wecode-switching,
aim to develop an LIDwe observe
system thatthat
canthe identify the code-Cheng Lyu
majority
directly
of the sentences belong to Type-2 intra-sentential mode. nological U
switching instances instead of separately modelling the underlying languages, i.e. a joint LID
However, due to lack of availability of the domain-specific Malaysia [2
system. resources, the research activity is somewhat limited. spontaneou
The monolingual automatic speech recognition (ASR) view and c
systems may be capable of recognizing a few words from porean and
1.5 Multi-task learning
a foreign language but are unable to handle a significant
• Han-Ping
amount of code-switching in the data. On account of the
Chinese-En
existence
In conventional of different
deep learning, variantsareofgenerally
neural networks English pronunciations
optimized for a single metric. In
National C
and this,
order to achieve code-switching
we train a singleeffects,
network the
or an development of anto ASR
ensemble of networks achieve our goal.contains 12
system for Hindi-English (Hinglish) code-switching speech
This approach
data hasisa a
slight downside intask.
challenging that we
Toignore information
the best of ourthat could help us performspeakers ut
knowledge,
there
better on our givenistask.
no Specifically,
large-sizedwe Hinglish
can obtaincorpus availablebyfor
this information carry-
training the signals• ofA small H
ingWeout
related tasks. cantheshareresearch. Towards
representations addressing
between that
many related constraint,
tasks which can enable ourwas collect
we recently created a Hinglish corpus covering all typical Kong Univ
model to perform better on our given original task. This paradigm is commonly referred to as
sources of variations such as accent, session, channel, age, corpus is
gender,(MTL).
multi-task learning etc. In Thethis work,
neural networkwe architecture
describe the detailstheofparadigm
illustrating that of MTLspeech [14
is shown in corpus
Figure 1.3.and also present basic experimental evaluation is from 9 spe
done on the same.
The remainder of this paper is organized as follows: In • A corpus o
Section 2, we review the code-switching corpora currently pus was cr
reported in the literature. In Section 3, the details about database c
Hinglish speech and text corpus along with that of the nec- sourced fro
essary lexical resources for developing the Hinglish ASR speakers.
system, are presented in detail. The experimental evalua- • Emre Ylm
tions using the created Hinglish 6 corpus has been presented Dutch cod
in Section 5. The paper is concluded in Section 6. cast speec
Output A Output B Output C
Task A Task B Task C Task-

specific
layers
Main Task
Shared
layers
Input
Figure 1.3: Multi-Task Learning paradigm
1.6 Contribution of this thesis

The major contributions of this thesis are as follows:
1. Building and comparing the performances of two types of E2E ASR systems with con-
ventional DNN-HMM systems on two distinct ASR tasks (namely monolingual ASR and
code-switching ASR)
2. proposal and experimental validation of a novel target set reduction scheme for E2E
speech recognition of code-switching data.
3. proposal and experimental validation of a novel joint modelling based LID system for
code-switching speech.
We are pleased to report that the experimental findings of this thesis have been submitted as two
papers in Interspeech-2019. The details of the same have been presented in Chapter 8.
7
Chapter 2
Literature survey
The earliest technological innovation that fueled the rise of E2E systems is CTC. It was pro-
posed by Graves et. al. [3] as a way to train an acoustic model without requiring frame-level
alignments as is the case with conventional HMM-based models. Before the advent of CTC,
speech recognition models based on Recurrent Neural Networks (RNN) required pre-aligned
and segmented data for efficient training in addition to extensive post-processing. CTC allevi-
ated these limitations and allowed unsegmented speech data to be labelled directly. We look at
CTC in more detail in section 3.
In a new wave of development of E2E systems, Graves and Jaitley came up with a com-
plete CTC-based E2E ASR system [5]. They proposed a system with character-based CTC
which directly outputs word sequences given input speech without requiring an intermediate
phonetic representation. The architecture utilized deep BLSTM units and the CTC objective
function for training. In addition to this, they also proposed to modify the objective function by
introducing a new loss function(namely the transcription loss) that directly optimizes the word
error rate even in the absence of a lexicon or language model. The proposed system obtained
a word error rate of 8.2 % on the popular WSJ corpus as opposed to the best result (8.7 %)
obtainable using conventional systems. This work successfully demonstrated that it is possi-
ble to directly produce character transcriptions from the speech data using RNNs with minimal
preprocessing and in absence of a phonetic representation.
As an extension of their previous work, Graves et. al. [20] came up with a deep RNN
architecture to explore the effectiveness of learning deep representations of speech data which
had earlier proved to be useful for computer vision tasks. They proposed augmenting a CTC-
based model with a recurrent language model component with both the blocks being trained
8
jointly on the acoustic data. The architecture showed promising results for the TIMIT phoneme
recognition task and the advantages of deep RNNs in modelling speech became immediately
obvious, but the work did not get as much traction as CTC.
In addition to CTC based models, another major class of architecture prevalent in the
ASR community is the encoder-decoder based model. Attention-based encoder-decoder models
emerged first in the context of neural machine translation. The initial applications of attention-
based models to ASR are found in Chan et. al. [21] and Chorowski et. al. [6]. In these archi-
tectures, encoder plays the role of the acoustic model from the conventional ASR pipeline as it
transforms input speech into higher-level representation, attention is analogous to the alignment
model as its role is to identify encoded frames that are relevant to producing current output and
the decoder is equivalent to the pronunciation and language models as it operates autoregres-
sively by predicting each output as a function of the previous predictions. These architectures
are explained in detail in section 3.2. Listen, Attend and Spell [21] is a popular example of the
encoder-attention-decoder architecture and is hence discussed further in section 3.3.
In a more recent development, Shinji Watanabe et. al. proposed a novel architecture that
combines CTC and attention to learn better representations and speed up the convergence of
attention under the paradigm of multi-task learning. They proposed the inclusion of CTC cost
function as an auxiliary task with main task being attention to learn a shared-encoder represen-
tation. The multi-task framework allows simultaneous optimization of both the cost functions.
The model significantly outperformed both the CTC and attention model individually while en-
abling faster convergence of attentions. This paves a new way for future research in ASR and
enables adaptation of diverse architectures in a single unified model as we can define domain
dependent sub-tasks with appropriate loss functions under the bigger umbrella of an E2E system
and jointly optimize the main and sub-tasks.
Let us now look at the phenomenon of code switching. It has been observed that people
use words of a foreign language while conversing in their native tongue so as to effectively
communicate with other people [11, 12]. The recent spread of urbanization and globalization
have positively impacted the growth of bilingual/multilingual communities and hence made this
phenomenon more prominent. The growth in such communities has made automatic recogni-
tion of code-switching speech an important area of interest [13–15]. In India, Hindi is the native
language of around 50% of its 1.32 billion population [16]. A large portion of the remaining
half, especially those residing in metropolitan cities understand the Hindi language well enough.
9
Due to prominence in administration, law and corporate world, English language is also used
by around 125 million people in India. Thus, Indians naturally tend to use some English words
within their Hindi discourse, which is referred to as Hindi-English code-switching [17,18]. De-
spite the increasing code-switching phenomenon, the research activity in this area is somewhat
limited due to lack of resources, specially for building robust code-switching ASR systems. E2E
systems are fast becoming the norm for ASR task. Unlike the conventional systems, the E2E
systems directly model the output labels given the acoustic features. This is usually achieved
by employing two techniques: (i) CTC [3, 4], and (ii) sequence to sequence modelling with at-
tention mechanism [5–8]. Conventionally, the E2E systems have been trained for characters as
output labels which simplifies the process of data preparation. In [9], it is shown that grapheme-
based E2E ASR systems slightly outperform phoneme-based systems when a large amount of
data (>12,500 Hrs) is used for training.
LID finds applications in many areas including automatic recognition of code-switching
speech. In [19], the authors developed an LID system for code-switching speech by employ-
ing separate large vocabulary continuous speech recognisers (LVCSRs). Recently, researchers
have explored E2E networks in many speech/text processing applications. The E2E networks
can be trained by employing two techniques: (i) CTC [3], and (ii) seq2seq with attention
mechanism [6]. Current literature amply demonstrates that the attention-based E2E systems
outperform the CTC-based E2E systems. Recently, an utterance-level LID system employing
attention-mechanism is explored [22]. In that, for producing attentional vectors, a set of pre-
trained language category embeddings are used as a look-up table.
10
Chapter 3
Theoretical background
As mentioned earlier, the traditional methods for speech require separate submodules which are
independently trained and optimized. For example, acoustic model takes acoustic features as
input and predicts a set of phonemes as outputs. The pronunciation model maps each sequence
of phonemes to corresponding word in the dictionary. Finally, the language model assigns
probability to the word sequences and determines whether a sequence of words is probable or
not. E2E systems aim at learning all these components jointly as a single system with the aid of
a sufficient amount of data [23]. CTC and sequence-to-sequence (seq2seq) attention are the two
main paradigms around which the entire E2E ASR revolves. These approaches have advantages
in terms of training and deployment in addition to the modeling advantages which we describe
in this section.
3.1 Connectionist temporal classification

Recurrent neural networks (RNNs) (Figure 3.1) are the suitable architecture for modeling of
sequential data where we make use of context and sequential information. RNNs are widely
Figure 3.1: Recurrent neural network architecture (image adapted from [1])
11
used in language modeling, natural language processing and machine translation tasks. In the
area of speech recognition, RNNs are typically trained as frame classifiers, which then require a
separate training target for every frame. HMMs have been traditionally used to obtain an align-
ment between the input speech and its transcription. CTC is an idea that makes the training for
sequence transcription tasks possible for an RNN while removing the need of prior alignment
between input and output sequences. While using CTC, the output layer of the RNN contains
one unit each for each of the transcription labels (for example, phonemes) in addition to one
extra unit referred to as a ‘blank’ which corresponds to the emission of ‘nothing’ - a null emis-
sion.
For a given training speech example, there are as many possible alignments as there are ways
of separating the labels with blanks. For example, if we use φ to denote a null emission, the
alignments (φ c φ a t) and (c φ a a a φ φ t t) correspond to the same transcription ‘cat’. While
decoding, we remove the labels that repeat in successive time-steps. At every time-step, the
network decides whether to emit a symbol or not. As a result, we obtain a distribution over all
possible alignments between the input and target sequences. Finally, CTC employs a dynamic
programming based forward-backward algorithm to obtain the sum over all possible alignments
and produces the probability of output sequence given a speech input. CTC is not very efficient
with respect to modelling sequences E2E as it assumes that the outputs at various time-steps
are conditionally independent of each other. As a result, it is incapable of learning the language
model on itself. However, it enables the RNN to learn the acoustic and pronunciation models
jointly and omits an HMM/GMM construction step.
The output layer in CTC contains a single unit for each of the transcription labels (characters /
phonemes) in addition to the blank. More formally, let the length of an input speech sequence
x be T and the output vectors yt be normalized using softmax. We interpret these as the proba-
bility of outputting a particular label with index k at a time t:
exp(ytk )
P(k, t|x) = P k0
k’ exp(y t )
where ytk is elemnent k of yt . We define a CTC alignment a as a length T sequence of blank

and label indices. Then, the probability P(a|x) is the product of emission probabilities at every
time-step:
T
Y
P(a|x) = P(at , t|x)
t=1
As mentioned earlier, for a given transcription sequence, there are many possible align-
12
𝑦1 𝑦2 ... 𝑦𝑀−1 𝑦𝑀
CTC Decoder
Encoder
𝑥1 𝑥2 𝑥3 𝑥4 … 𝑥𝑁−1 𝑥𝑁
Input
Figure 3.2: Architecture of the CTC-based E2E network. The encoder is a deep network con-
sisting of BLSTMs.
ments on account of ‘blank’ insertions. Also, if the same label appears on successive time steps
in an alignment, the repeats are removed. For example, (c c a a a t) will also correspond to
‘cat’. Let B denote an operator that removes the repeated labels and then, the blanks from
alignments. The total probability of an output transcription y is the sum of the probabilities of
the alignments that correspond to it. So,
X
P(y|x) = P(a|x)
a∈B −1 (y)
This summing over all possible alignments makes way for the architecture to be trained
without requiring prior alignments. Given a target transcription y ∗ , the network is trained to
minimize the CTC cost function:
CTC(x) = −log P(y ∗ |x)
CTC provides the sequence of phonemes, predicting a phoneme only when it is very sure
of it’s presence and outputting the ’blank’ symbol for the rest of the frames. Traditional ap-
proaches heavily depend on the frame alignment and give poor results even on slight misalign-
ment, whereas CTC overcomes this obstacle and thus provides a more robust way of training
RNN for sequence transcription tasks without requiring any prior alignment information.
13
3.2 Sequence to sequence attention mechanism
In Sequence to Sequence (seq2seq) learning, RNN is trained to map an input sequence to an
output sequence which is not necessarily of the same length. It has been used in sequence
transduction tasks like neural machine translation, image captioning and automated question
answering. As the nature of speech and the transcriptions is sequential, seq2seq is a viable
proposition for speech recognition as well. In this paradigm, the seq2seq model (Figure 3.3)
attempts to learn to map the sequential variable-length input and output sequences with an en-
coder RNN that maps the variable-length input into a fixed length context vector representation.
This vector representation is used by a decoder to output the sequences which can vary in length.
Figure 3.3: A seq2seq model (image adapted from [2])
An attention mechanism (Figure 3.4) can be used to produce a sequence of vectors from
the encoder RNN for each time-step of the input sequence. These basically refer to the adjacent
time frames the system should look at any particular instant to make a decision. The decoder
learns to pay selective attention to these vectors to produce the output at each time-step. The
attention vector is used to pass on information through the neural network at every input step
in contrast to simple seq2seq model. Attention mechanism is a dominant technique in Com-
puter Vision for visual recognition tasks where the neural network model focuses on a specific
region of the image with ’high resolution’ while perceiving the rest of the image in ’low reso-
lution’ originally inspired by human visual attention. It has been successfully applied to object
recognition and image captioning tasks.
Attention mechanism is different from CTC in the sense that it does not assume the outputs
to be conditionally independent of each other. As a result, it can learn the language model on
14
Figure 3.4: A seq2seq model with attention
itself in addition to the acoustic and pronunciation models jointly in a single network aided by
availability of large amounts of speech data.
3.3 Listen, attend and spell

For our model, we chose the Listen, Attend and Spell(LAS) architecture by Chan et. al. [21], as
it has emerged as a popular approach in the E2E speech recognition community over the last few
years. LAS is a neural network that learns to transcribe speech utterances to character sequences
without making any independence assumption between the characters. This system combines
a seq2seq based encoder-decoder architecture with an attention mechanism. The encoder and
the decoder are trained together, making it a complete E2E system. We describe the baseline
architecture and its working in following subsections.
3.3.1 Overview
The E2E LAS architecture takes acoustic features as inputs and the output of the network is
sequence of characters/phonemes depending on the way it is trained. Let x = (x1 , ..., xT ) be the
input sequence of filterbank spectra features, and let y = (hsosi, y1 , ..., ys ,heosi) be the output
sequence of characters/phonemes.
15
𝑦1 𝑦2 ... 𝑦𝑀−1 𝑦𝑀
...
Attender and Speller

𝑐1 𝑐2 𝑐𝑀−1 𝑐𝑀
𝑠1 𝑠2 ...
𝑠𝑀−1 𝑠𝑀
𝒉
Listener
𝑥1 𝑥2 ... 𝑥𝑁−1 𝑥𝑁
Input
Figure 3.5: Architecture of LAS network. It consists of three modules namely: listener (en-
coder), attender (alignment generator), and speller (decoder).
Each character output yi is modeled as a conditional distribution over the previous charac-
ters y< i and the input sequence x using the chain rule:
P (y|x) = Πi P (yi |x, y< i )
The LAS architecture is composed of two sub-modules: the listener and the speller. The listener
is the acoustic encoder that transforms the original input signal x into a higher level represen-
tation h. The AttendAndSpell function takes h as input and produces a probability distribution
over character/phoneme sequences:
h = Listen(x)
P (y|x) = AttendAndSpell(h, y)
The individual sub-modules are described in the following sections.
16
3.3.2 Encoder (listener)
Listener network (Figure 3.5) employs a Bidirectional Long Short Term Memory (BLSTM)
employing a pyramidal structure to reduce the length of encoded vector. Pyramidal structure is
required as an utterance is a very long vector containing thousands of frames in length leading
to slower convergence. The structure of pyramidal BLSTM leads to a reduction in the number
of time-steps by a factor of two for each layer added leading to a reduction in computational
complexity and helps the attention to converge faster.
3.3.3 Decoder (speller)
The Attend and Spell network (Figure 3.5) uses an attention-based LSTM transducer. This
network produces a distribution over the next character conditioned on all the characters emitted
before for each output time step.
Let the decoder state at a time step i be si which is a function of the previous state si-1 , yi-1
and ci-1 . The context vector ci is the output of attention mechanism.
ci = ContextForAttention(si , h)
si = RNN(si − 1 , yi − 1 , ci − 1 )
P (yi |x, y< i ) = DistributionOverChar(si , ci )
where DistributionOverChar(.) is a multi-layer perceptron with softmax outputs over char-

acters. The ContextForAttention(.) is a function that evaluates the energy ei, u for each input
step u, from hu and si . This energy ei, u is transformed into a distribution over input steps αi
using softmax. Finally, ci is produced by linear combination of the hu , for each input step:
eiu = hψ(si ), φ(hu )i

exp(ei,u )
αi, u = P
u exp(ei,u )
X
ci = αi, u hu
u
where ψ and φ are multi-layer perceptron networks.
17
3.3.4 Learning
The Listen(.) and AttendAndSpell(.) functions are trained jointly to maximize the following
log probability:
X
max logP (yi |y∗< i ; θ)
θ
i
where y∗< i is the ground truth of the past emissions.
3.3.5 Decoding
After training the LAS maodel, we can extract the character sequence that is the most likely
from the input acoustic features:
ŷ = argmax logP (y|x)

y
For decoding, a left-to-right beam search is used which is extensively used in machine transla-
tion tasks.
3.4 Implementation
TensorFlowr [24] is a deep learning toolbox from Google which aids in the implementation of
deep learning models. Particularly, we use an open source ASR toolkit Nabu [25] which uses
TensorFlowr as it’s backend. The toolkit works in several stages: data preparation, training
and finally testing and decoding. It supports various essential models namely CNN + BLSTM
or pyramid BLSTM based encoder, dot-product/locally-aware attention, LSTM/RNN decoder
and Kaldi (famous conventional ASR modeling toolkit) style data processing, feature extrac-
tion/format, and recipes. This helps us design architectures of our choice and easily experiment
with various hyperparameters of our model. While working with deep learning models, large
computational resources are required. For implementation of our architecture and training, we
used an NVIDIA Quadro GPU with 2 GB of memory aided by 128 GB of RAM. Deep learn-
ing libraries tend to throw compatibility issues because of mismatch in versions. In particular,
TensorFlowr version 1.8.0 has been used for performing the experiments.
18
Chapter 4
Exploring end-to-end solutions in

code-switching domain
In this chapter, we explore E2E approaches for tackling two important problems in the code-
switching domain, i.e. automatic speech recognition and language identification. We first
present the conventional unified character set approach and highlight the problems with the
same. Next, our proposed target set reduction scheme is presented in detail. This is followed by
a detailed discussion on the proposed joint intra-sentential language identification scheme.
4.1 Unified character set approach

The conventional E2E ASR systems are trained directly from speech data (filterbank energies)
with characters as the target labels. In [9], it is shown that grapheme-based E2E ASR systems
slightly outperform phoneme-based systems when a large amount of data (>12,500 hrs) is used
for training. In the context of code-switching, a conventional E2E ASR system models the
unified character set of the underlying languages. In context of Hindi-English code-switching
task, the unified character set modelling would comprises of 95 targets (26 English characters,
68 Hindi characters, and a word separator). These targets are presented in the first two columns
of Table 4.1. For evaluating the performance of E2E ASR networks on this unified target set,
both CTC-based and attention-based E2E ASR systems have been built and their performance
has been reported in Table 6.5. It is evident that they do not perform as well as their monolingual
counterpart. This was expected because due to the presence of 2 or more languages in code-
switching task, the target set becomes greater than double the size as compared to monolingual
19
Hindi अ आ इ ई उ ऊ ऋ ए ऐ ओ औ क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न
characters प फ ब भ म य र ल व श ष स ह क़ ख़ ग़ ज़ झझ़ ड़ ढ़ फ़ ो ॉ ै ् ौ ृ ू ु ी े झ़ ा ँ ः ं
English
a b c d e f g h i j k l m n o p q r s t u v w x y z
characters
Common a aa i ii u uu rq ee ei o ou k kh g gh ng
th d dh nc ch j jh nj tx txh dx dxh nx t
Figure 4.1: The top two rows show the default Hindi and English character sets, respectively.
phone set p ph b bh m y r l w sh sx s h kq khq gq z
ai er ng oy jhq dxq dxhq f q hq mq ao ae au
The proposed reduced target labels covering both Hindi and English sets are shown in the bot-
tom row.
characters प फ ब भ म य र ल व श ष स ह क़ ख़ ग़ ज़ झझ़ ड़ ढ़ फ़ ो ॉ ै ् ौ ृ ू ु ी े झ़ ा ँ ः ं
English characters a b c d e f g h i j k l m n o p q r s t u v w x y z
Common a aa i ii u uu rq ee ei o ou k kh g gh ng c ch j jh nj tx txh dx dxh nx t th d dh n
phone set p ph b bh m y r l w sh sx s h kq khq gq z jhq dxq dxhq f q hq mq ao ae au ai er ng oy
characters प फ ब भ म य र ल व श ष स ह क़ ख़ ग़ ज़ झझ़ ड़ ढ़ फ़
task even though the amount of data available for training remains the same. Thus, eventually
ं ो ॉ ै ् ौ ृ ू ु ी े झ़ ा ँ ः
English characters a b c d e f g h i j k l m n o p q r s t u v w x y z
Common a aa i ii u uu rq ee ei o ou k kh g gh ng c
t th d dh nch j jh nj tx txh dx dxh nx
comparatively
phone set
less amount of data is now available to learn reliableq representation
p ph b bh m y r l w sh sx s h kq khq gq z jhq dxq dxhq f
for each of
hq mq ao ae au ai er ng oy
the characters in the target set. As the amount of data available for training cannot be increased
due to availability constraints, we need to device intelligent models which can perform well
even in the absence of huge amount of data for training. Inspired by the acoustic similarity of
the languages involved in code-switching data and the fact that E2E ASR systems basically try
to learn the probability of emission of a target label given acoustic data, we propose a novel
target set reduction scheme for code-switching ASR which is presented in the next section.
4.2 Proposed target set reduction scheme

As observed in the previous section, in the context of code-switching, a conventional E2E ASR
system would model the unified character set of the underlying languages. With unified charac-
ter set modelling, such a system would face the following challenges:
• More than double expansion in the target set.
• Enhanced confusion among the target labels.
• Requirement of more data for reliable modelling.
• Weakening of attention mechanism, if employed.
Towards addressing those constraints, we explore a target set reduction scheme by exploiting
the acoustic similarity in the underlying languages of the code-switching task. This scheme
is primarily intended to enhance the performances of code-switching E2E ASR systems. The
validation of the proposed idea has been done on Hindi-English code-switching task using both
E2E network and hybrid deep neural network based hidden Markov model (DNN-HMM).
20
Figure 4.2: Sample examples for the proposed common phone level labelling and existing char-
acter level labelling schemes for E2E ASR system training. Note that for the given words, the
unique targets when tokenized at character level turns out to be 22 and 12 unique targets when
tokenized using the proposed scheme.
Word Default scheme Proposed scheme
bharat bharat bh aa r a t
भारत भ ◌ा र त bh aa r a t
english english i ng l ii sh
अं ेज़ी अ ◌ं ग ◌् र ◌े ज ◌ी a ng r ei j ii
Proposed Scheme for Reduction of Target Set
Despite the expansion of the target set in the case of code-switching E2E ASR, the phone sets
corresponding to the underlying languages may have significant acoustic similarity. This fact is
well known and has been exploited in the creation of a common phone set across languages [26].
Motivated by that, we propose a scheme for target set reduction in code-switching E2E ASR task
by creating common target labels based on acoustic similarity. In the following, the proposed
scheme has been explained in detail in the context of Hindi-English code-switching ASR task
which has been used in this work for experimentation. In principle, it can be extended to any
other code-switching context as well.
Hindi and English languages comprise of 68 and 26 characters, respectively. For reference
purpose, those are shown in the top two rows of Table 4.1. In [26], a composite phone set
covering major Indian languages is proposed in the context of computer processing. On a
similar line, a phone set for English has been defined. Combining the labels for Hindi and
English, a common phone set comprising 62 elements is derived and is shown in the bottom row
of Table 4.1. Using this common phone set, a dictionary keeping the default pronunciations for
all Hindi and English words in the HingCoS corpus (described in detail in Chapter 5) is created.
Now, the targets for acoustic modelling are derived by taking the pronunciation breakup of all
Hindi and English words. A few example words along with their default character-level and the
proposed common phone-level tokenizations are shown in Figure 4.2. It can be observed that
the considered Hindi/English words lead to 22 unique targets when tokenized at the character
level and 12 unique targets when tokenized using the proposed scheme. For the Hindi-English
code-switching task, the proposed approach results in 34% reduction in the size of the target
21
set. The importance of this reduction gets enhanced by the fact that the availability of code-
switching data is yet limited. Due to a smaller target set, the labels can now be learnt more
reliably even from a smaller amount of data. This will have a big impact on the performance of
the corresponding ASR systems. The experimental results discussed later in Chapter 6 support
this argument.
4.3 Proposed joint LID scheme

The task of detecting the languages present in spoken or written data using machines is referred
to as LID. It finds applications in many areas including automatic recognition of code-switching
speech. In [19], the authors developed an LID system for code-switching speech by employing
separate LVCSRs. In this work, we aim to develop an LID system that can directly identify the
code-switching instances instead of separately modelling the underlying languages. Recently,
researchers have explored E2E networks in many speech/text processing applications. The
E2E networks can be trained by employing two techniques: (i) CTC, and (ii) seq2seq with
attention mechanism [6]. Current literature amply demonstrates that the attention-based E2E
systems outperform the CTC-based E2E systems. Recently, an utterance-level LID system
employing attention-mechanism is explored [22]. In that, for producing attentional vectors,
a set of pretrained language category embeddings are used as a look-up table. Motivated by
those works, we develop a joint LID system for code-switching speech using an attention-based
E2E network. Unlike [22], the attention provided for the LID system is intra-sentential [27]
and is dynamic. The salient contributions of this work include: (i) a novel application of E2E
networks in developing a join LID system for code-switching speech, and (ii) demonstration of
the effectiveness of the attention mechanism in locating the code-switching instances.
For identifying the languages present in code-switching speech data, the developed sys-
tems jointly model the underlying languages and can handle intra-sentential code-switching
types. So, in this work, we refer to them as joint LID systems. For developing those systems,
we first explore the CTC-based E2E network and it is followed by experimentation on LAS [28]
network which employs an attention mechanism.
22
Figure 4.3: Creation of character-level LID tags for the training data towards condition-
ing the E2E networks to perform LID task on code-switching speech. The ‘H/E’ denotes
Hindi/English LID tag. The ‘b/e’ label is appended to the ‘H/E’ LID tag to mark the be-
gin/end characters.
Creation of Target Labels for LID
In intra-sentential code-switching utterances, the duration of the embedded foreign words/

phrases could be very short. Thus, LID is required to be performed at the word level rather
than the utterance level. Further, the explored E2E networks are required to be conditioned to
perform the LID task on code-switching speech data. For achieving that, for each of the training
utterance, first the given orthographic transcription is transformed into character level transcrip-
tion. Later, each character in the transcription is mapped to the corresponding LID tags. This
process is illustrated in Figure 4.3. This is to highlight that, in the orthographic transcription
of the training data, the Hindi and English words are written in their respective scripts. So,
the character-level LID tags as ‘H/E’ are produced in a straight forward manner, except that
additional labels ‘b’ and ‘e’ are appended to the tags of begin and end characters of each word,
respectively. Also, a blank symbol ‘|’ has been inserted between words to ease the marking
of the word boundary. For training the E2E models, a total of 8 labels which include 6 LID
tags (Hb, H, He, Eb, E, Ee), one blank label (|), and a silence label (sil) are given as targets
to generate the output posterior probabilities. With the proposed target labelling scheme, the
attention-based E2E system is hypothesized to predict the language boundaries more accurately.
The experimental results discussed later in Chapter 6 support the same.
23
Chapter 5
Experimental setup
5.1 Database
In this work, we explore the working of E2E ASR systems for 2 distinct tasks - monolingual
ASR and code-switching ASR. The databases used for each of these tasks are described below.
5.1.1 Monolingual database - TIMIT
The monolingual experiments in this work are performed on a small standard speech corpus,
TIMIT. The TIMIT dataset includes phonetic and lexical speech transcriptions of American
English speakers of both genders hailing from eight different dialectical regions. The dataset
consists of a total of 6300 utterances which are spoken by a total of 630 speakers, with 10 sen-
tences per speaker. The alphabet contains a total of 48 phonemes including silence.
The entire dataset is divided into train, development and test sets with 8 sentences per
speaker as proposed by the creators of the dataset. The resulting distribution is as following
(each speaker contributing 8 utterances):
• Training set: 462 speakers
• Development set: 50 speakers
• Test set: 24 speakers
24
Figure 5.1: Sample code-switching sentences in HingCoS corpus and their corresponding En-
glish translations.
Hindi-English कयय आप मम झझ Deccan queen कय departure time बतय सकतझ हह |
code-switching Functions कझ नयम भभ lowercase letter सझ हभ शम र हहतझ हह |
sentences ककपयय ममझझ मझ रय current account balance बतयएए |
English Can you tell me the departure time of Deccan queen.
translated The names of the functions also start with a lowercase letter.
versions Please tell me my current account balance.
5.1.2 Code-switching database - HingCoS
The code-switching experiments in this work are performed using the Hindi-English code
switching speech corpus. This database is referred to as the HingCoS corpus. A detailed
description of the same is available in [29]. A few example ones along with their English
translations are shown in Table 5.1.
It consists of 101 speakers, each of whom has been asked to sound 100 unique code-
switching sentences given to him/her. The length of those sentences varies from 3 to 60 words.
All speech data is recorded at 8 kHz sampling rate and 16-bits/sample resolution. The database
contains 9251 Hindi-English code-switching utterances which correspond to about 25 hours of
speech data. For ASR system modelling, the database is partitioned into train, development
and test sets containing 7015, 1152 and 2136 sentences, respectively. To study the effect of
utterance-length in decoding, three partitions of test set are created on the basis of length of
utterances. Those partitions correspond to utterance-length ranges as 3-15, 16-25, and 26-60
words and are referred to as Test1, Test2, and Test3, respectively. So obtained, Test1, Test2, and
Test3 data sets consist of 957, 719, and 460 utterances, respectively. The unified character set
modelling case comprises of 95 targets (26 English characters, 68 Hindi characters, and a word
separator). In contrast, the proposed scheme reduces that to 63 targets (62 common phones and
a word separator). In this work, we contrast the performances of the proposed reduced target
set based E2E ASR systems with those of unified character set based ones.
5.2 Evaluation metric

The evaluation of a speech recognition model is done on the Phone Error Rate (PER) metric.
Phone Error Rate is computed using the formula:
25
Sp + Dp + Ip
PER = × 100
Np
where,
• Sp : Number of substitutions
• Dp : Number of deletions
• Ip : Number of insertions
• Np : Number of phonemes in the reference transcription
For LID task, the developed E2E systems are evaluated in terms of the LID error rate computed
as
NS + NI + ND
LID error rate = × 100
N
where, the numerator terms NS , NI , and ND refer to the number of substitutions, insertions, and
deletions, respectively. The denominator N refers to the total number of labels in the reference.
For this evaluation, the reference transcriptions for all test utterances labeled in terms of the
proposed LID tags are aligned with the corresponding outputs produced by the E2E network.
In addition to this character-level LID error rate, a corresponding word-level LID error rate is
also computed in a similar fashion by applying majority voting scheme [30] on the character-
level LID labels.
5.3 System description
5.3.1 Monolingual experiments
The raw audio samples are preprocessed to extract filterbank energies as features. The hyper-
parameters used for training the E2E LAS architecture on TIMIT database are summarized in
Table 5.1. Experiments on the well known database TIMIT have been performed to benchmark
our implementation of E2E ASR systems.
5.3.2 Code-switching experiments
The E2E models developed in this work are trained using the Nabu toolkit [25], which is based
on TensorFlowr . For contrast purpose, the DNN-HMM systems have also been trained and
26
Hyperparameters Values
Encoder
Layers 2+1
Units/layer 128 each direction
Dropout 0.5
Decoder
Layers 2
Units/layer 128
Beam Width 16
Dropout 0.5
Training
Initial Learning Rate 0.2
Decay Exponential: 0.1
Batch Size 32
Table 5.1: Tuning Hyperparameters for monolingual experiments
evaluated using Kaldi toolkit [31]. The parameter setting used for analyzing the speech data
include window length of 25 ms, window shift of 10 ms, and pre-emphasis factor of 0.97.
The 26-dimensional features comprising log filter-bank energies are used for developing E2E
systems. It is to be noted that, the E2E systems are optimized for the reduced target set and the
same parameters have been used for the unified character set systems. The remaining details of
the above mentioned systems are presented next.
Attention-based E2E model
The architectural details of the LAS model are as follows. The encoder has 3 pyramidal
DBLSTM layers with 512 units in each layer. The pyramidal step size is kept as 2 and the
dropout rate in training is set to 0.5. The LSTM decoder consists of 2 layers with 512 units in
each layer. The dropout rate for the LSTM decoder is also set to 0.5. Loss function used for
training is the average cross-entropy loss and Gaussian noise with σ = 0.6 is added to the data
while training. We have employed the beam-search decoder with beam width set as 16. The
model is trained for 400 epochs. Batch size used is 32 with learning rate decay fixed at 0.1.
27
CTC-based E2E model
This modelling paradigm involves a DBLSTM network as the encoder which consists of 4
layers and 256 units in each layer with dropout rate set to 0.5. The decoder utilizes CTC loss
function as discussed in Section 3.1. Gaussian noise with σ = 0.6 is added to the speech data for
modelling robustness. In model training, the number of epochs is set as 250 and the mini-batch
size is set to 32.
DNN-HMM model
The DNN-HMM acoustic model contains 5 hidden layers and 1024 nodes in each layer. The
hidden nodes use tanh as the non-linearity. First, 13-dimensional MFCC features are spliced
across ±3 frames to produce 91-dimensional feature vectors, which are then projected to 40
dimensions by applying linear discriminant analysis. These 40-dimensional feature vectors are
used for training the DNN-HMM acoustic model. The model is run for 20 epochs with a batch
size of 128.
5.3.3 Experimental setup for LID
In this subsection, we describe the tuning of parameters for both the developed LID systems
done on the development set defined earlier. In this work, the attention-based E2E LID system
is trained by employing the LAS network in which the encoder (listener) has 2 hidden layers,
each with 128 BLSTM nodes. The dropout rate of the encoder is set as 0.5. The number of
hidden layers and nodes of the decoder (speller) are kept same as that of the encoder, except
that the nodes are simple LSTMs. The LAS network is trained by setting the number of epochs
as 300, batch size as 32, and the learning rate decay as 0.1. During decoding, the beam width
is set as 8. For contrast purpose, a CTC-based E2E LID model is trained with 3 hidden layers,
each having 128 BLSTM nodes. The dropout rate of the encoder is set to be similar to that of
the LAS model. The network is trained with number of epochs as 300. Also, the parameters
corresponding to the batch size, and the learning rate decay are set as 8 and 0.1, respectively.
During decoding, the CTC cost function is employed to produce 1-best output sequence.
28
Chapter 6
Results and discussions
In this chapter, we first discuss the results for E2E ASR task (monolingual ASR , code-switching
ASR and results of proposed target set reduction scheme) followed by the results obtained for
the proposed joint LID task.
6.1 Results on TIMIT

The TensorFlowr -based implementation was trained on TIMIT corpus with hyperparameters
as mentioned in the previous section. Training was performed for 300 epochs and model was
validated every 500 steps. We achieved a phoneme error rate of 22.0 % on the core test set.
The performance (in PER) of other ASR models on the TIMIT dataset is tabulated below for
comparison purposes.
Model Dev Test
Large margin GMM [32] - 33%

CTC end to end training [33] 19.05% 21.97%
DBLSTM hybrid training [33] 17.44% 19.34%
CNN limited weight sharing [34] - 20.50%
Bayesian triphone HMM [35] - 25.6%
Listen, attend and spell 20.8% 22.0%
Table 6.1: Comparison of previously reported phone error rates with our implementation on
core test set for TIMIT database
29
6.2 Results on HingCoS
6.2.1 ASR results
In ASR task, for the unified target set case, the performances are measured in terms of the
character error rate (CER). Whereas, for the reduced target set case, we have used the phone
error rate (PER) as the measure. For proper evaluation, both attention and CTC based E2E ASR
systems are developed using reduced and unified target sets and their performances are reported
in Table 6.5. It can be observed that with proposed reduction in target set, all explored E2E
systems yield significantly improved recognition performance (i.e., target error rate) over the
corresponding unified target set based systems. Interestingly, this trend is carried over all the
three test sets as defined earlier. On comparing the reduced target set systems, we note that the
attention-based E2E ASR system has outperformed. Whereas, the CTC-based E2E system has
yielded slightly better CER for the unified target set modelling case.
Test1 Test2 Test3 Average

Model
PER CER PER CER PER CER PER CER
Attention-based E2E 21.01 33.69 21.06 34.80 23.70 39.38 21.92 35.96
CTC-based E2E 32.91 35.82 28.89 32.85 28.33 33.87 30.04 34.18
DNN-HMM 48.21 48.74 47.85 48.17 47.88 48.62 47.98 48.51
Table 6.2: Evaluation of attention and CTC based E2E systems developed using both reduced
and unified target sets on Hindi-English code-switching data. The performances of reduced and
unified target set based systems are measured using phone error rate (PER) and character error
rate (CER), respectively. The performances for DNN-HMM system on those tasks are also
given for reference purpose.
It is worth emphasizing that with more reduction in the target set further improvement in
PERs could be achieved in reference to CERs. But any such reduction would be counterproduc-
tive if we can not derive accurate word sequences given the output hypotheses in terms of those
reduced target labels. That criterion is very much satisfied by the proposed phone based reduc-
tion of the target set in the case of code-switching speech. On the other hand, for unified target
set based E2E ASR systems, the decoded outputs may comprise of cross-language character
insertions due to acoustic similarity. Towards illustrating that, we show a few example decoded
sequences for both reduced and unified target set based E2E systems in Table 6.3. From that
table, we can note that the decoded sequence for the attention-based E2E system exhibits better
30
Target Set System Decoded sequence
Oracle a g a r _ aa p k o _ y a h _ p o s tx _ p a s a q d _ aa y aa _ h o _ t o _ aa p _ i s ee _ l ai k _ j a r uu r _ k a r ee
Reduced Attention based a g a r _ aa p k o _ y ee _ p o s tx _ p a s a q d _ aa y aa _ h o _ t o _ aa p _ i s ee _ l ai k _ j a r uu r _ k a r ee
CTC based a g a r _ aa p k o _ y ee _ p u ch _ p a s a n _ aa y aa _ f o t o _ ao p tx i sh a l ai _ k a r ee
Reference उ द ◌ा ह र ण _क ◌े_र ◌ू प_म ◌े ◌ं_क ि◌ स ◌ी_ c o m p a n y _क ◌े_a b o u t _ u s _ p a g e _म ◌े ◌ं_ ज ◌ा न क ◌ा र ◌ी _ ह ◌ै
sequence
Unified modelling
Attention decoded asउ well
द ◌ा ह रas
ण _word
क ◌े _ लboundary
ि◌ ए _ क ि◌ स ◌ी _ क o m p in
marking _ क ◌े _ r e l a t e to
a n ycomparison ◌े ◌ं _of
d _मthat म ◌ेCTC-based
र ◌ी _ ज ◌ा न क ◌ा र ◌ी _ ह ◌ै
CTC decoded
system. This trend is attributed to the ability of attention-based E2E network to utilize all the
previous decoded labels along with the current input while making decisions.
Oracle a g a r _ aa p k o _ y a h _ p o s tx _ p a s a q d _ aa y aa _ h o _ t o _ aa p _ i s ee _ l ai k _ j a r uu r _ k a r ee
Reduced Attention based a g a r _ aa p k o _ y ee _ p o s tx _ p a s a q d _ aa y aa _ h o _ t o _ aa p _ i s ee _ l ai k _ j a r uu r _ k a r ee
CTC based a g a r _ aa p k o _ y ee _ p u ch _ p a s a n _ aa y aa _ f o t o _ ao p tx i sh a l ai _ k a r ee
Oracle उ द ◌ा ह र ण _क ◌े_र ◌ू प_म ◌े ◌ं_ c o m p a n y _क ◌े_a b o u t _ u s _ p a g e _म ◌े ◌ं_ ज ◌ा न क ◌ा र ◌ी _ ह ◌ै
Unified Attention based उ द ◌ा ह र ण _ क ◌े _ ल ि◌ ए _ क

o m p a n y _ क ◌े _ r e l a t e d _म ◌े ◌ं_ ज ◌ा न क ◌ा र ◌ी _ ह ◌ै
CTC based उ द ◌ा र ण _ स ◌े _ र प m ◌े_ c o m p a n y _ क ◌े _ ए b o u t a r s _ p a r _ m e n t ◌ं _ज ◌ा न क ◌ा र ◌ी _ ह ◌ै

न न
Table 6.3: Sample decoded outputs for E2E code-switching ASR systems developed using
reduced and unified target sets. The errors have been highlighted in bold. Note that, the symbol
‘ ’ is used to mark separation between the words.
Oracle m ee r ee _ b l ao g _ k aa _ tx ao p i k _ k y aa _ h o n aa _ c aa h i ee
Reduced Attention based m ee r ee _ b l ao g _ p a r _ tx ao p i k _ k y aa _ h o n aa _ c aa h i ee
CTC based m ee r ee _ b l ao g _ p a _ tx ao p i k _ k y aa _ c aa h i ee
Oracle म ◌े र ◌े _ b l o g _ क ◌ा _ t o p i c _ क ◌् य ◌ा _ ह ◌ो न ◌ा _ च ◌ा ह ि◌ ए
Unified Attention based म ◌े र ◌े _ b l o g _ क ◌ो _ p r o f e s s i o n a l _ ह ◌ो न ◌ा _ च ◌ा ह ि◌ ए
CTC based म ◌े र ◌े _ b l o g _ क ◌ा _ t o f i c _ क ◌् य ◌ा _ ह ◌ो न ◌ा _ च ◌ा ह ि◌ ए
Oracle i s k ee _ k o n _ s ee _ ei s ee _ f ii c er s _ h ei
Reduced Attention based i s k ee _ k o n _ s ee _ ei s ee _ f ii c er _ h ei
CTC based i s k ee _ k au n _ s ee _ ei s ee _ c u _ h ei
Oracle इ स क ◌े _ क ◌ौ न _ स ◌े _ ऐ स ◌े _ f e a t u r e s _ ह ◌ै
Unified Attention based इ स क ◌े _ क ◌ा र ण _ स ◌े _ ऐ स ◌े _ f e a t u r e s _ ह ◌ै
CTC based इ स क ◌े _ क o म _ स ◌े _ ऐ स ◌े _ f e a t u r e _ ह ◌ै
Oracle y a h aa mq _ p a r _ aa p k o _ tx uu _ ao p sh a n _ m i l ee g aa
Reduced Attention based y a h aa mq _ p a r _ aa p k o _ tx uu _ ao p sh a n _ m i l ee q g ee
CTC based y a h aa mq _ p a r _ aa p k o _ tx o _ ao p n _ m i l ee g aa
Oracle य ह ◌ा ◌ँ _ प र _ आ प क ◌ो _ t w o _ o p t i o n _ म ि◌ ल ◌े ग ◌ा
Unified Attention based य ह ◌ा ◌ँ _ प र _ आ प क ◌ो _ o p t i o n s _ म ि◌ ल ◌े ◌ं ग ◌े
CTC based य ह ◌ा ◌ँ _ प र _ आ प क ◌ो _ t w o _ o p ◌् t o n _ म ि◌ ल ◌े ग ◌ा
Table 6.4: Sample decoded outputs for E2E code-switching ASR systems developed using
reduced and unified target set for comparison across models.
31
LID Target LID error
ND NI NS
system label rate (%)
Character 73,502 3,384 20,957 49.20

CTC
Word 3,576 3,655 5,136 30.14
Character 20,299 13,587 12,789 23.47

Attention
Word 2,713 1,616 2,484 16.60
Table 6.5: Evaluation of the developed E2E LID systems in Hindi-English code-switching task.
The LID error rates have been computed both for character and word levels. The total number
of characters/words (N ) in the reference transcription is 198, 855/41, 025.
6.2.2 LID results
In this work, two different kinds of E2E joint LID systems are developed and evaluated on the
HingCoS corpus. The LID error rates computed both at character and word levels for these
systems are reported in Table 6.5. In contrast to CTC, the use of LAS architecture in E2E
LID system is noted to yield substantial reduction in the error rates. This is attributed to the
ability of attention mechanism in LAS network to accurately predict the languages switching in
the data. To highlight that, we have computed the language-specific averaged attention weights
with respect to the decoded LID label sequence and the plot for the same is shown in Figure 6.1.
The description of each of the subplots in Figure 6.1 is presented next.
Figure 6.1(a) shows the spectrogram of a typical Hindi-English speech utterance in the
test set. Note that, the spectrogram is manually labeled with spoken words and their boundaries
for the reference purposes. The variations of the averaged attention weights for Hindi and
English language targets present in the input speech data with respect to time, are shown in
Figure 6.1(b). The sequence alignment produced by the attention network for the input speech
data (on the x-axis) and the decoded output LID labels (on the y-axis) is plotted in Figure 6.1(c).
From Figures 6.1(b) and 6.1(c), we observe that the attention weights for Hindi and English
languages mostly peak around the corresponding word locations.
It is worth highlighting here that both CTC-based and attention-based E2E systems are
provided with identical target-level supervision while training. Unlike the attention-based sys-
tem, the CTC-based system could not exploit that supervision. This is attributed to the fact that
CTC assumes the outputs at different time steps to be conditionally independent, hence making
it less capable of learning the sequence. To support this argument, for the very utterance in
32
sil | इस | case | में | cash | को | select | कीजिए | sil
(a)
(b)
(c)
Figure 6.1: Visualization of attention mechanism for LID task. For a given Hindi-English
code-switching utterance: a) spectrogram labeled with Hindi and English word boundaries for
reference purpose. (b) variation of attention weights with respect to time for Hindi and English
languages, and (c) alignment produced by the attention network for the input speech and the
decoded output LID labels.
sil | ये | आपके | schedule | को | affect | करे गा | sil
(a)
(b)
(c)
Figure 6.2: Visualization of attention mechanism for LID task - second example
33
Figure 6.1, the character level decoded outputs of the CTC- and attention-based E2E LID sys-
tems are listed in Table 6.6. The word-level LID labels for both the considered systems are also
shown in that table. On comparing the hypothesized sequence of output labels, it can be noted
that the inclusion of attention mechanism in E2E LID system leads to more effective language
identification within code-switching speech data.
Reference sequence Hb He | Eb E Ee | Hb H He | Eb E Ee | Hb He | Eb E E E E Ee | Hb H H H He
Character level
CTC-based hypothesis Hb E Ee | Eb E Ee | Eb He | Hb He | Eb Ee | Eb Ee | Hb He
LID lables
Attention-based hypothesis Hb He | Eb E E E Ee | Hb H He | Eb E E Ee | Hb He | Eb E E E E Ee | Hb H H H He
Reference sequence H|E|H|E|H|E|H

Word level
CTC-based hypothesis E|E|E|H|E|E|H
LID lables
Attention-based hypothesis H|E|H|E|H|E|H
Table 6.6: The character and word level decoded outputs for CTC- and attention-based E2E
LID systems for the utterance considered in Figure 6.1. A majority voting scheme is employed
for mapping the character-level LID label sequences to word-level LID label sequences. The
attention-based system is able to decode the LID label sequences more accurately when com-
pared to the CTC-based system.
34
Chapter 7
Conclusion and future work
In this work , we have built and tried to understand the working of E2E ASR systems and
have compared there performance to conventional DNN-HMM systems. Further, to extend
the prowess of these E2E systems to tasks like code-switching ASR where a large amount of
data is not available for training, we present a novel target label reduction scheme for training
the E2E code-switching ASR systems. The systems employing the reduced targets are shown
to outperform the unified target based systems. It has been demonstrated that the attention
based E2E system trained with reduced target set achieves the best averaged target (phone) er-
ror rate. Further, as code-switching data comprises of 2 or more languages, thus to build better
code-switching ASR systems, we also propose a joint E2E LID systems employing CTC and
attention mechanism for identifying the languages present in code-switching speech. The de-
velopment and evaluation of the proposed systems are done on Hindi-English code-switching
speech corpus. Towards developing the LID systems, a novel target labeling scheme has been
introduced which is found to be very effective for the attention-based system. On comparing
the attention and CTC mechanisms, the former is noted to achieve a two-fold reduction in both
character- and word-level LID error rates. The work also demonstrates the ability of the atten-
tion mechanism in detecting the language boundaries in code-switching speech data. Despite
the experiments being performed on Hindi-English code-switching data, the proposed approach
can easily be extended to other code-switching contexts.
Our aim is to now build a joint LID-ASR system for E2E ASR of code-switching speech
under the paradigm of multitask learning. as described in Section 1.5. This is motivated by a
recent work [36], where the authors reported improvement in Mandarin-English code-switching
ASR by employing multi-task learning with the LID labels. Further, we aim to play with the at-
35
tention mechanism by employing a supervised attention framework (with supervision provided
from LID task) for building more effective attention based E2E code-switching ASR systems.
Work along this direction has already started and we aim to complete the same by the next few
months.
36
Chapter 8
List of publications
Papers from BTP work
1. Kunal Dhawan, Ganji Sreeram, Kumar Priyadarshi and Rohit Sinha, “Investigating
Target Set Reduction for End-to-End Speech Recognition of Hindi-English Code-Switching
Data”, submitted to Interspeech, 2019.
2. Ganji Sreeram, Kunal Dhawan, Kumar Priyadarshi and Rohit Sinha, “Joint Language
Identification of Code-Switching Speech using Attention based E2E Network”, submitted
to Interspeech, 2019.
Other publications
1. Sreeram Ganji, Kunal Dhawan and Rohit Sinha, “IITG-HingCoS Corpus: A Hinglish
Code-Switching Database for Automatic Speech Recognition”, Speech Communication
(2019), DOI:https://doi.org/10.1016/j.specom.2019.04.007
37
Bibliography
[1] C. Blog, “Recurrent neural network architecture,” [Online] https://colah.github.io/posts/

2015-08-Understanding-LSTMs, accessed on: 2019-04-15.
[2] T. Data science, “Introduction to sequence to sequence models,” [Online]

https://towardsdatascience.com/sequence-to-sequence-model-introduction-and-concepts,
accessed on: 2019-04-15.
[3] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classi-

fication: Labelling unsegmented sequence data with recurrent neural networks,” in Proc.
of International Conference on Machine Learning, 2006, pp. 369–376.
[4] A. Graves, “Sequence transduction with recurrent neural networks,” in Proc. of Interna-
tional Conference on Machine Learning: Representation Learning Workshop, 2012.
[5] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural
networks,” in Proc. of International Conference on Machine Learning, 2014, pp. 1764–
1772.
[6] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recog-
nition using attention-based recurrent NN: First results,” in Proc. of Deep Learning and
Representation Learning Workshop, 2014.
[7] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to
align and translate,” in Proc. of International Conference on Learning Representations,
2015.
[8] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly, “A comparison of

sequence-to-sequence models for speech recognition.” in Proc. of Interspeech, 2017, pp.
939–943.
38
[9] T. N. Sainath et al., “No need for a lexicon? evaluating the value of the pronunciation
lexica in end-to-end models,” CoRR, vol. abs/1712.01864, 2017.
[10] J. J. Gumperz, Discourse Strategies. Cambridge University Press, 1982.
[11] C. M. Eastman, “Codeswitching as an urban language-contact phenomenon,” Journal of

Multilingual & Multicultural Development, vol. 13, no. 1-2, pp. 1–17, 1992.
[12] C. M. Scotton, “Comparing codeswitching and borrowing,” Journal of Multilingual &

Multicultural Development, vol. 13, no. 1-2, pp. 19–39, 1992.
[13] D. C. Lyu, R. Y. Lyu, Y. C. Chiang, and C. N. Hsu, “Speech recognition on code-switching

among the Chinese dialects,” in Proc. of International Conference on Acoustics, Speech
and Signal Processing (ICASSP), vol. 1, 2006.
[14] K. Bhuvanagirir and S. K. Kopparapu, “Mixed language speech recognition without ex-
plicit identification of language,” American Journal of Signal Processing, vol. 2, no. 5, pp.
92–97, 2012.
[15] B. H. Ahmed and T.-P. Tan, “Automatic speech recognition of code switching speech using
1-best rescoring,” in Proc. of International Conference on Asian Language Processing
(IALP), 2012, pp. 137–140.
[16] LIS-India, “1991 census of India,” [Online] http://www.ciil-lisindia.net/, accessed on:

2019-03-29.
[17] S. Malhotra, “Hindi-English, code switching and language choice in urban, uppermiddle-
class Indian families,” University of Kansas. Linguistics Graduate Student Association,
1980.
[18] K. Bali, J. Sharma, M. Choudhury, and Y. Vyas, “I am borrowing ya mixing? an analysis

of English-Hindi code mixing in facebook,” in Proc. of the First Workshop on Computa-
tional Approaches to Code Switching, 2014, pp. 116–126.
[19] D. C. Lyu and R. Y. Lyu, “Language identification on code-switching utterances using

multiple cues,” in Proc. of Interspeech, 2008.
39
[20] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neu-
ral networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing, May 2013, pp. 6645–6649.
[21] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for
large vocabulary conversational speech recognition,” in 2016 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing, 2016, pp. 4960–4964.
[22] W. Geng et al., “End-to-end language identification using attention-based recurrent neural
networks,” in Proc. of Interspeech, 2016.
[23] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, “End-to-end

attention-based large vocabulary speech recognition,” CoRR, vol. abs/1508.04395, 2015.
[Online]. Available: http://arxiv.org/abs/1508.04395
[24] M. Abadi et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,”

2015, accessed on: 2018-02-22. [Online]. Available: http://tensorflow.org/
[25] Vincent, “Nabu: An end-to-end speech recognition toolkit,” [Online] https://vrenkens.

github.io/nabu/, accessed on: 2019-03-24.
[26] B. Ramani et al., “A common attribute based unified HTS framework for speech synthesis
in Indian languages,” in Proc. of 8th ISCA Workshop on Speech Synthesis, 2013.
[27] K. A. H. Zirker, “Intrasentential vs. intersentential code switching in early and late bilin-
guals,” 2007.
[28] W. Chan et al., “Listen, attend and spell: A neural network for large vocabulary conversa-
tional speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech
and Signal Processing, pp. 4960–4964.
[29] S. Ganji, K. Dhawan, and R. Sinha, “IITG-HingCos corpus: A Hinglish Code-Switching

Database for Automatic Speech Recognition,” Speech Communication, 2019.
[30] B. Parhami, “Voting algorithms,” IEEE Transactions on Reliability, vol. 43, no. 4, pp.
617–629, 1994.
[31] D. Povey et al., “The Kaldi speech recognition toolkit,” IEEE Signal Processing Society,
2011.
40
[32] F. Sha and L. K. Saul, “Large margin gaussian mixture modeling for phonetic classification
and recognition,” in 2006 IEEE International Conference on Acoustics Speech and Signal
Processing, May 2006, pp. I–I.
[33] A. Graves, N. Jaitly, and A. Mohamed, “Hybrid speech recognition with deep bidirectional
lstm,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Dec
2013, pp. 273–278.
[34] O. Abdel-Hamid, l. Deng, and D. Yu, “Exploring convolutional neural network structures
and optimization techniques for speech recognition,” in Proc. of Interspeech, 2013.
[35] J. Ming and F. J. Smith, “Improved phone recognition using bayesian triphone models,” in
1998 IEEE International Conference on Acoustics Speech and Signal Processing, 1998,
pp. I–409.
[36] N. Luo, D. Jiang, S. Zhao, C. Gong, W. Zou, and X. Li, “Towards end-to-end code-
switching speech recognition,” arXiv preprint arXiv:1810.13091, 2018.
41

BTP Thesis rs1 End-To-End-Asr

Uploaded by

Copyright:

Available Formats

BTP Thesis rs1 End-To-End-Asr

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BTP Thesis rs1 End-To-End-Asr

Uploaded by

Copyright:

Available Formats

End-to-End Automatic Speech Recognition

A thesis submitted in partial fulfillment of

Kunal Dhawan and Kumar Priyadarshi

Under the guidance of

DEPARTMENT OF ELECTRONICS & ELECTRICAL ENGINEERING

This is to certify that the work contained in this thesis entitled

End-to-End Automatic Speech Recognition

Kunal Dhawan and Kumar Priyadarshi

List of Figures vii

4 Exploring end-to-end solutions in code-switching domain 19

6 Results and discussions 29

7 Conclusion and future work 35

1.1 Components of an ASR pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 2

3.1 Recurrent neural network architecture (image adapted from [1]) . . . . . . . . 11

ASR Automatic Speech Recognition

HMM Hidden Markov Model

DNN Deep Neural Network

CTC Connectionist Temporal Classification

RNN Recurrent Neural Networks

LAS Listen Attend and Spell

BLSTM Bidirectional Long Short Term Memory

WER Word Error Rate

PER Phoneme Error Rate

CER Character Error Rate

1.1 Conventional ASR systems

1.1.1 Hidden Markov models

Figure 1.1: Components of an ASR pipeline

1.1.2 Deep neural network - hidden Markov model)

Assuming conditional independence p(X|S, W) ≈ p(X|S) (which is a reasonable assumption

3. These systems make conditional independence assumptions to use Gaussian mixture

1.2 End-to-end ASR systems

1.3 Code switching

Task A Task B Task C Task-

Figure 1.3: Multi-Task Learning paradigm

1.6 Contribution of this thesis

3.1 Connectionist temporal classification

where ytk is elemnent k of yt . We define a CTC alignment a as a length T sequence of blank

CTC(x) = −log P(y ∗ |x)

Figure 3.3: A seq2seq model (image adapted from [2])

3.3 Listen, attend and spell

Attender and Speller

P (y|x) = Πi P (yi |x, y< i )

The individual sub-modules are described in the following sections.

3.3.3 Decoder (speller)

P (yi |x, y< i ) = DistributionOverChar(si , ci )

where DistributionOverChar(.) is a multi-layer perceptron with softmax outputs over char-

eiu = hψ(si ), φ(hu )i

where ψ and φ are multi-layer perceptron networks.

where y∗< i is the ground truth of the past emissions.

ŷ = argmax logP (y|x)

Exploring end-to-end solutions in

4.1 Unified character set approach

4.2 Proposed target set reduction scheme

• More than double expansion in the target set.

• Enhanced confusion among the target labels.

• Requirement of more data for reliable modelling.

• Weakening of attention mechanism, if employed.

Word Default scheme Proposed scheme

Proposed Scheme for Reduction of Target Set

Reduced Attention based a g a r _ aa p k o _ y ee _ p o s tx _ p a s a q d _ aa y aa _ h o _ t o _ aa p _ i s ee _ l ai k _ j a r uu r _ k a r ee

CTC based a g a r _ aa p k o _ y ee _ p u ch _ p a s a n _ aa y aa _ f o t o _ ao p tx i sh a l ai _ k a r ee

Reduced Attention based a g a r _ aa p k o _ y ee _ p o s tx _ p a s a q d _ aa y aa _ h o _ t o _ aa p _ i s ee _ l ai k _ j a r uu r _ k a r ee

CTC based a g a r _ aa p k o _ y ee _ p u ch _ p a s a n _ aa y aa _ f o t o _ ao p tx i sh a l ai _ k a r ee

CTC based उ द ◌ा र ण _ स ◌े _ र प m ◌े_ c o m p a n y _ क ◌े _ ए b o u t a r s _ p a r _ m e n t ◌ं _ज ◌ा न क ◌ा र ◌ी _ ह ◌ै