BTP Thesis rs1 End-To-End-Asr
BTP Thesis rs1 End-To-End-Asr
BTP Thesis rs1 End-To-End-Asr
Bachelor of Technology
by
is the work of
for the award of the degree of Bachelor of Technology, carried out in the
Department of Electronics and Electrical Engineering, Indian Institute of
Technology Guwahati under my supervision and that it has not been
submitted elsewhere for a degree.
Guide
Date:
Place:
DECLARATION
The work contained in this thesis is our own work under the supervision of the
guides. We have read and understood the “B. Tech./B. Des. Ordinances and
Regulations” of IIT Guwahati and the “FAQ Document on Academic Malprac-
tice and Plagiarism” of EEE Department of IIT Guwahati. To the Best of our
knowledge, this thesis is an honest representation of our work.
Author
Date:
Place:
Acknowledgments
First and foremost, we would like to express our heartfelt gratitude towards prof. Rohit Sinha
for always being there to guide us irrespective of his busy schedule. We learnt a lot from Sir and
would always be grateful to sir for giving us the opportunity to work on this ambitious project.
Further, we would like to sincerely thank Mr. Ganji Sreeram for his constant support and help
throughout the project. This thesis would not have been possible without his valuable inputs
and debugging discussions.
Abstract
The goal of an automatic speech recognition system (ASR) is to accurately and efficiently con-
vert a given speech signal into its corresponding text transcription of the spoken words, irre-
spective of the recording device, the speaker’s accent, or the acoustic environment. To achieve
this, several models such as dynamic time warping, hidden Markov models and deep neural net-
works have been proposed over time in literature. These conventional models give rise to very
complicated systems consisting of various sub-modules. An end-to-end (E2E) automatic speech
recognition system greatly simplifies this pipeline by replacing these complicated sub-modules
with a deep neural network architecture employing data-driven linguistic learning methods. As
the target labels are learned directly from speech data, the E2E systems need a bigger corpus
for effective training. In this work, we try to understand the working of these E2E ASR systems
and extend their prowess to tasks where comparatively less amount of training data is avail-
able. For working on such problem, the task of code-switching speech recognition is chosen.
Code-switching refers to the phenomenon of switching between two or more languages while
speaking in multilingual communities. In this work, we aim to address two important prob-
lems in code-switching domain using end-to-end approach. The first problem is the automatic
speech recognition of code-switching data which cannot be directly tackled using conventional
approaches due to multiple languages involved. The second task is the word-level automatic
language identification (LID) in the context of intra-sentential code-switching. All experimen-
tal validations have been performed on recently created IITG HingCoS (Hindi-English code-
switching) corpus. Also, results for conventional deep neural network - hidden Markov model
based system have been reported for the purpose of contrast.
Keywords: end-to-end speech recognition, language identification, code-switching, at-
tention mechanism
iv
Contents
Abstract iv
Nomenclature ix
1 Introduction 1
1.1 Conventional ASR systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Hidden Markov models . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Deep neural network - hidden Markov model) . . . . . . . . . . . . . . 2
1.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 End-to-end ASR systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Code switching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Language identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Multi-task learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature survey 8
3 Theoretical background 11
3.1 Connectionist temporal classification . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Sequence to sequence attention mechanism . . . . . . . . . . . . . . . . . . . 14
3.3 Listen, attend and spell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Encoder (listener) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Decoder (speller) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
v
3.3.4 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.5 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Experimental setup 24
5.1 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.1.1 Monolingual database - TIMIT . . . . . . . . . . . . . . . . . . . . . . 24
5.1.2 Code-switching database - HingCoS . . . . . . . . . . . . . . . . . . . 25
5.2 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.3 System description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.1 Monolingual experiments . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.2 Code-switching experiments . . . . . . . . . . . . . . . . . . . . . . . 26
5.3.3 Experimental setup for LID . . . . . . . . . . . . . . . . . . . . . . . 28
8 List of publications 37
vi
List of Figures
4.1 The top two rows show the default Hindi and English character sets, respec-
tively. The proposed reduced target labels covering both Hindi and English sets
are shown in the bottom row. . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Sample examples for the proposed common phone level labelling and existing
character level labelling schemes for E2E ASR system training. Note that for
the given words, the unique targets when tokenized at character level turns out
to be 22 and 12 unique targets when tokenized using the proposed scheme. . . . 21
4.3 Creation of character-level LID tags for the training data towards conditioning
the E2E networks to perform LID task on code-switching speech. The ‘H/E’
denotes Hindi/English LID tag. The ‘b/e’ label is appended to the ‘H/E’ LID
tag to mark the begin/end characters. . . . . . . . . . . . . . . . . . . . . . . . 23
vii
5.1 Sample code-switching sentences in HingCoS corpus and their corresponding
English translations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.1 Visualization of attention mechanism for LID task. For a given Hindi-English
code-switching utterance: a) spectrogram labeled with Hindi and English word
boundaries for reference purpose. (b) variation of attention weights with respect
to time for Hindi and English languages, and (c) alignment produced by the
attention network for the input speech and the decoded output LID labels. . . . 33
6.2 Visualization of attention mechanism for LID task - second example . . . . . . 33
viii
Nomenclature
E2E End-to-End
ix
Chapter 1
Introduction
Automatic speech recognition aims at enabling devices to correctly and efficiently identify spo-
ken language and convert it into text. Some of the most important applications of speech recog-
nition include speech-to-text processing, audio information retrieval, keyword search and gen-
erating streaming captions in videos. From a technology viewpoint, speech recognition has
evolved with several waves of novel innovations. Traditional general-purpose speech recog-
nition systems are based on hidden Markov models. This when combined with deep neural
networks led to major improvement in various components of the speech recognition pipeline.
The hidden Markov model (HMM) is an important framework for construction of speech recog-
nition systems because speech has temporal structure and it can be efficiently encoded as a
sequence of spectral vectors inside the audio frequency range. A conventional ASR pipeline
consists of a feature extraction module (typically Mel-frequency cepstral coefficients are used
as features), a hidden Markov model & Gaussian mixture model (HMM-GMM) based acoustic
model which captures the information about how a word has been pronounced and a n-gram
language model which helps to ascertain what sequence of words can possible occur in a given
1
language. This pipeline has been graphically depicted in figure 1.1.
Language
model
Feature Feature
Speech Decoder Words
Extraction vectors
Acoustic Pronunciation
model dictionary
HMM has traditionally been combined with Gaussian mixture model (GMM) to do acoustic
modelling for speech recognition , where HMM models the probability of going from one
acoustic state (which is typically a triphone) to another and GMM models the probability of
occurrence of that particular acoustic state. But it has been experimentally observed that using
deep neural networks instead of GMM to model the (emission) probability of an acoustic state
has lead to better results and hence deep neural network- hidden Markov models (DNN-HMM)
systems are being widely used for acoustic modelling in speech recognition currently.
DNN is a feedforward artificial neural network having several hidden layers enabling it to learn
complex representations. Each hidden unit in a layer uses non-linear functions like tanh to map
the feature input from previous layer to the next layer. The outputs of DNN are then fed to
HMM.
Let us now perform a mathematical analysis of an ASR system and understand the impor-
tance of each of the stages as discussed in the pipeline. Automatic speech recognition deals with
mapping of speech from a T -length speech input feature sequence, X = (xt ∈ RD |t = 1, ..., T ),
to an N -length word sequence, W = (wn ∈ V |n = 1, ..., N ). Here, xt is a D-dimensional speech
feature vector at frame t, and wn is a word in the vocabulary, V. We mathematically formulate
2
the ASR problem using Bayes decision theory, thus the task reduces to the estimation of the
most probable word sequence Ŵ, among all word sequences V∗ that are possible, as follows:
Ŵ = argmax p(W|X)
W∈V∗
So, the ASR problem reduces to obtaining the posterior distribution p(W|X). Assuming
a hybrid DNN-HMM model, we use the HMM state sequence S = (st ∈ 1, .., J|t = 1, ..., T )
(where J is the total number of HMM states) to factorizes p(W|X) as follows:
Ŵ = argmax p(W|X)
W∈V∗
X (1.1)
= argmax p(X|S, W)p(S|W)p(W)
W∈V∗
S
These three distributions p(X|S), p(S|W) and p(W) represent the acoustic, lexicon and lan-
guage model in the conventional ASR pipeline respectively.
1.1.3 Limitations
This conventional ASR system based on DNN-HMM is a pretty complicated system consisting
of various sub-modules dealing with separate acoustic, pronunciation and language models (as
seen in above derivation). Listed below are some factors that make this method sub-optimal
with regards to speech recognition performance:
1. Many module-specific processes are required for efficient working of the final model.
2. Curating pronunciation lexicon and defining phoneme sets for the particular language
requires expert linguistic knowledge, and is time-consuming.
3
4. All the different modules are optimized separately with different objectives, which re-
sults in a sub-optimal model as a whole as the individual modules are not optimized to
match the constraints of other sub-modules. In addition, the training objectives and final
evaluation metrics are very different from each other.
4
𝑃 𝑦1 𝑥) 𝑃 𝑦𝑇 𝑥) 𝑃 𝑦𝑢 𝑦𝑢−1 , … , 𝑥)
...
a) Softmax b) Softmax
ℎ𝑒𝑛𝑐 ℎ𝑢𝑑𝑒𝑐
DBLSTM LSTM
encoder Decoder
... 𝑦𝑢−1 𝑐𝑢
𝑥1 𝑥𝑇
Attention
𝑎𝑡𝑡
ℎ𝑢−1 ℎ𝑒𝑛𝑐
DBLSTM
encoder
...
𝑥1 𝑥𝑇
Figure 1.2: Block diagram of E2E Networks using: a) CTC mechanism, and b) attention mech-
anism.
conversing in their native tongue so as to effectively communicate with other people [11, 12].
The recent spread of urbanization and globalization have positively impacted the growth of
bilingual/multilingual communities and hence made this phenomenon more prominent. The
growth in such communities has made automatic recognition of code-switching speech an im-
portant area of interest [13–15]. In India, Hindi is the native language of around 50% of its
1.32 billion population [16]. A large portion of the remaining half, especially those residing
in metropolitan cities understand the Hindi language well enough. Due to prominence in ad-
ministration, law and corporate world, English language is also used by around 125 million
people in India. Thus, Indians naturally tend to use some English words within their Hindi
discourse, which is referred to as Hindi-English code-switching [17,18]. Despite the increasing
code-switching phenomenon, the research activity in this area is somewhat limited due to lack
of resources, specially for building robust code-switching ASR systems. A few examples of dif-
ferent code-switching varieties is presented in Table 1.1. Note that type 1 and type 2 categories
refer to high and low contextual information carried by the non-native language (here English),
respectively. Note that the experiments in this work have been carried out on intra-sentential
code-switching speech corpus.
5
ककपयत मतझके मकेरत चतलल खततत शकेष रतरश बततएए
Hindi
कतयर्यों कके नतम भबी छहोटके अक्षर सके हबी शतरू हहोतके हहैं
Table
मकेरत atm1: पतपरकExample
खहो रयत हह तहो महHinglish
अपनके भतरततन कहो sentences showing
the inter-sentential
कहसके रहोक सकतत हत हुँ we briefly revie
code-switching and the variants of the intra-sentential code-
can you tell me the departure time of deccan queen
switching. Type-1 and Type-2 variants of intra-sentential code- marizing their
please tell me my current account balance
English switching refer to high and low contextual information being carried
the names of the functions also start with a lowercase letter
by
my the non-native
atm card (English)
is lost so how can I stopwords, respectively.
my payment
• The CUM
speech cor
she is the daughter of ceo, वह यहतहुँ दहो रदन कके रलए आई हह al., at the
Inter-sentential
मतझके अमकेरबीकत ममें चतर सतल हहो रए, but I still miss my country It contains
Type-1
मतझके मकेरत current account balance जतननत हह the speake
भतरत ममें popular free virtual credit card services रकतनबी हहैं
Intra-sentential read by 40
अपनके budget कके अनतसतर investments कर सकतके हहैं
Type-2
class और object कके बबीच relationship क्यत हह
• A small M
corpus was
Table 1.1: phenomenon
Sample Hinglish is also observed
sentences showing theindifferent
chats,varieties
comments, and
of code-switching data.
Dau-cheng
messages posted on the social media sites like Facebook,
tains 4000
1.4 Language identification
Twitter, WhatsApp, YouTube, etc. Table 1 shows a few
ances recor
example sentences of different modes of code-switching
The task of while
detecting highlighting
the languagesthe differences
present in spoken orinwritten
the contextual infor- is referred
data using machines • The Engli
mation carried by the non-native words. In Type-1 intra-
to as language identification. It finds applications in many areas including automatic recognitionwas compi
sentential code-switching, the non-native language words versity of T
of code-switching
either speech.
occur in Traditionally,
sequence LID has been
or form performed
a phrase, by employing
thus carry some separate large
of transcri
contextual information. Whereas, in Type-2 case, the non-
vocabulary continuous speech recognisers (LVCSRs) [19]. Thus, in this approach, separateers.
native language words are embedded into the native lan-
LID systems are built for each of the languages present in the data independently and task of
guage sentences in such a manner that virtually no con- • The SEAM
each systemtextual
is to predict that which of the words in a given sentence
information could be derived from those words. belong to that particularconversatio
language. InAlso, during
this work, wecode-switching,
aim to develop an LIDwe observe
system thatthat
canthe identify the code-Cheng Lyu
majority
directly
of the sentences belong to Type-2 intra-sentential mode. nological U
switching instances instead of separately modelling the underlying languages, i.e. a joint LID
However, due to lack of availability of the domain-specific Malaysia [2
system. resources, the research activity is somewhat limited. spontaneou
The monolingual automatic speech recognition (ASR) view and c
systems may be capable of recognizing a few words from porean and
1.5 Multi-task learning
a foreign language but are unable to handle a significant
• Han-Ping
amount of code-switching in the data. On account of the
Chinese-En
existence
In conventional of different
deep learning, variantsareofgenerally
neural networks English pronunciations
optimized for a single metric. In
National C
and this,
order to achieve code-switching
we train a singleeffects,
network the
or an development of anto ASR
ensemble of networks achieve our goal.contains 12
system for Hindi-English (Hinglish) code-switching speech
This approach
data hasisa a
slight downside intask.
challenging that we
Toignore information
the best of ourthat could help us performspeakers ut
knowledge,
there
better on our givenistask.
no Specifically,
large-sizedwe Hinglish
can obtaincorpus availablebyfor
this information carry-
training the signals• ofA small H
ingWeout
related tasks. cantheshareresearch. Towards
representations addressing
between that
many related constraint,
tasks which can enable ourwas collect
we recently created a Hinglish corpus covering all typical Kong Univ
model to perform better on our given original task. This paradigm is commonly referred to as
sources of variations such as accent, session, channel, age, corpus is
gender,(MTL).
multi-task learning etc. In Thethis work,
neural networkwe architecture
describe the detailstheofparadigm
illustrating that of MTLspeech [14
is shown in corpus
Figure 1.3.and also present basic experimental evaluation is from 9 spe
done on the same.
The remainder of this paper is organized as follows: In • A corpus o
Section 2, we review the code-switching corpora currently pus was cr
reported in the literature. In Section 3, the details about database c
Hinglish speech and text corpus along with that of the nec- sourced fro
essary lexical resources for developing the Hinglish ASR speakers.
system, are presented in detail. The experimental evalua- • Emre Ylm
tions using the created Hinglish 6 corpus has been presented Dutch cod
in Section 5. The paper is concluded in Section 6. cast speec
Output A Output B Output C
Shared
layers
Input
1. Building and comparing the performances of two types of E2E ASR systems with con-
ventional DNN-HMM systems on two distinct ASR tasks (namely monolingual ASR and
code-switching ASR)
2. proposal and experimental validation of a novel target set reduction scheme for E2E
speech recognition of code-switching data.
3. proposal and experimental validation of a novel joint modelling based LID system for
code-switching speech.
We are pleased to report that the experimental findings of this thesis have been submitted as two
papers in Interspeech-2019. The details of the same have been presented in Chapter 8.
7
Chapter 2
Literature survey
The earliest technological innovation that fueled the rise of E2E systems is CTC. It was pro-
posed by Graves et. al. [3] as a way to train an acoustic model without requiring frame-level
alignments as is the case with conventional HMM-based models. Before the advent of CTC,
speech recognition models based on Recurrent Neural Networks (RNN) required pre-aligned
and segmented data for efficient training in addition to extensive post-processing. CTC allevi-
ated these limitations and allowed unsegmented speech data to be labelled directly. We look at
CTC in more detail in section 3.
In a new wave of development of E2E systems, Graves and Jaitley came up with a com-
plete CTC-based E2E ASR system [5]. They proposed a system with character-based CTC
which directly outputs word sequences given input speech without requiring an intermediate
phonetic representation. The architecture utilized deep BLSTM units and the CTC objective
function for training. In addition to this, they also proposed to modify the objective function by
introducing a new loss function(namely the transcription loss) that directly optimizes the word
error rate even in the absence of a lexicon or language model. The proposed system obtained
a word error rate of 8.2 % on the popular WSJ corpus as opposed to the best result (8.7 %)
obtainable using conventional systems. This work successfully demonstrated that it is possi-
ble to directly produce character transcriptions from the speech data using RNNs with minimal
preprocessing and in absence of a phonetic representation.
As an extension of their previous work, Graves et. al. [20] came up with a deep RNN
architecture to explore the effectiveness of learning deep representations of speech data which
had earlier proved to be useful for computer vision tasks. They proposed augmenting a CTC-
based model with a recurrent language model component with both the blocks being trained
8
jointly on the acoustic data. The architecture showed promising results for the TIMIT phoneme
recognition task and the advantages of deep RNNs in modelling speech became immediately
obvious, but the work did not get as much traction as CTC.
In addition to CTC based models, another major class of architecture prevalent in the
ASR community is the encoder-decoder based model. Attention-based encoder-decoder models
emerged first in the context of neural machine translation. The initial applications of attention-
based models to ASR are found in Chan et. al. [21] and Chorowski et. al. [6]. In these archi-
tectures, encoder plays the role of the acoustic model from the conventional ASR pipeline as it
transforms input speech into higher-level representation, attention is analogous to the alignment
model as its role is to identify encoded frames that are relevant to producing current output and
the decoder is equivalent to the pronunciation and language models as it operates autoregres-
sively by predicting each output as a function of the previous predictions. These architectures
are explained in detail in section 3.2. Listen, Attend and Spell [21] is a popular example of the
encoder-attention-decoder architecture and is hence discussed further in section 3.3.
In a more recent development, Shinji Watanabe et. al. proposed a novel architecture that
combines CTC and attention to learn better representations and speed up the convergence of
attention under the paradigm of multi-task learning. They proposed the inclusion of CTC cost
function as an auxiliary task with main task being attention to learn a shared-encoder represen-
tation. The multi-task framework allows simultaneous optimization of both the cost functions.
The model significantly outperformed both the CTC and attention model individually while en-
abling faster convergence of attentions. This paves a new way for future research in ASR and
enables adaptation of diverse architectures in a single unified model as we can define domain
dependent sub-tasks with appropriate loss functions under the bigger umbrella of an E2E system
and jointly optimize the main and sub-tasks.
Let us now look at the phenomenon of code switching. It has been observed that people
use words of a foreign language while conversing in their native tongue so as to effectively
communicate with other people [11, 12]. The recent spread of urbanization and globalization
have positively impacted the growth of bilingual/multilingual communities and hence made this
phenomenon more prominent. The growth in such communities has made automatic recogni-
tion of code-switching speech an important area of interest [13–15]. In India, Hindi is the native
language of around 50% of its 1.32 billion population [16]. A large portion of the remaining
half, especially those residing in metropolitan cities understand the Hindi language well enough.
9
Due to prominence in administration, law and corporate world, English language is also used
by around 125 million people in India. Thus, Indians naturally tend to use some English words
within their Hindi discourse, which is referred to as Hindi-English code-switching [17,18]. De-
spite the increasing code-switching phenomenon, the research activity in this area is somewhat
limited due to lack of resources, specially for building robust code-switching ASR systems. E2E
systems are fast becoming the norm for ASR task. Unlike the conventional systems, the E2E
systems directly model the output labels given the acoustic features. This is usually achieved
by employing two techniques: (i) CTC [3, 4], and (ii) sequence to sequence modelling with at-
tention mechanism [5–8]. Conventionally, the E2E systems have been trained for characters as
output labels which simplifies the process of data preparation. In [9], it is shown that grapheme-
based E2E ASR systems slightly outperform phoneme-based systems when a large amount of
data (>12,500 Hrs) is used for training.
LID finds applications in many areas including automatic recognition of code-switching
speech. In [19], the authors developed an LID system for code-switching speech by employ-
ing separate large vocabulary continuous speech recognisers (LVCSRs). Recently, researchers
have explored E2E networks in many speech/text processing applications. The E2E networks
can be trained by employing two techniques: (i) CTC [3], and (ii) seq2seq with attention
mechanism [6]. Current literature amply demonstrates that the attention-based E2E systems
outperform the CTC-based E2E systems. Recently, an utterance-level LID system employing
attention-mechanism is explored [22]. In that, for producing attentional vectors, a set of pre-
trained language category embeddings are used as a look-up table.
10
Chapter 3
Theoretical background
As mentioned earlier, the traditional methods for speech require separate submodules which are
independently trained and optimized. For example, acoustic model takes acoustic features as
input and predicts a set of phonemes as outputs. The pronunciation model maps each sequence
of phonemes to corresponding word in the dictionary. Finally, the language model assigns
probability to the word sequences and determines whether a sequence of words is probable or
not. E2E systems aim at learning all these components jointly as a single system with the aid of
a sufficient amount of data [23]. CTC and sequence-to-sequence (seq2seq) attention are the two
main paradigms around which the entire E2E ASR revolves. These approaches have advantages
in terms of training and deployment in addition to the modeling advantages which we describe
in this section.
Figure 3.1: Recurrent neural network architecture (image adapted from [1])
11
used in language modeling, natural language processing and machine translation tasks. In the
area of speech recognition, RNNs are typically trained as frame classifiers, which then require a
separate training target for every frame. HMMs have been traditionally used to obtain an align-
ment between the input speech and its transcription. CTC is an idea that makes the training for
sequence transcription tasks possible for an RNN while removing the need of prior alignment
between input and output sequences. While using CTC, the output layer of the RNN contains
one unit each for each of the transcription labels (for example, phonemes) in addition to one
extra unit referred to as a ‘blank’ which corresponds to the emission of ‘nothing’ - a null emis-
sion.
For a given training speech example, there are as many possible alignments as there are ways
of separating the labels with blanks. For example, if we use φ to denote a null emission, the
alignments (φ c φ a t) and (c φ a a a φ φ t t) correspond to the same transcription ‘cat’. While
decoding, we remove the labels that repeat in successive time-steps. At every time-step, the
network decides whether to emit a symbol or not. As a result, we obtain a distribution over all
possible alignments between the input and target sequences. Finally, CTC employs a dynamic
programming based forward-backward algorithm to obtain the sum over all possible alignments
and produces the probability of output sequence given a speech input. CTC is not very efficient
with respect to modelling sequences E2E as it assumes that the outputs at various time-steps
are conditionally independent of each other. As a result, it is incapable of learning the language
model on itself. However, it enables the RNN to learn the acoustic and pronunciation models
jointly and omits an HMM/GMM construction step.
The output layer in CTC contains a single unit for each of the transcription labels (characters /
phonemes) in addition to the blank. More formally, let the length of an input speech sequence
x be T and the output vectors yt be normalized using softmax. We interpret these as the proba-
bility of outputting a particular label with index k at a time t:
exp(ytk )
P(k, t|x) = P k0
k’ exp(y t )
As mentioned earlier, for a given transcription sequence, there are many possible align-
12
𝑦1 𝑦2 ... 𝑦𝑀−1 𝑦𝑀
CTC Decoder
Encoder
𝑥1 𝑥2 𝑥3 𝑥4 … 𝑥𝑁−1 𝑥𝑁
Input
Figure 3.2: Architecture of the CTC-based E2E network. The encoder is a deep network con-
sisting of BLSTMs.
ments on account of ‘blank’ insertions. Also, if the same label appears on successive time steps
in an alignment, the repeats are removed. For example, (c c a a a t) will also correspond to
‘cat’. Let B denote an operator that removes the repeated labels and then, the blanks from
alignments. The total probability of an output transcription y is the sum of the probabilities of
the alignments that correspond to it. So,
X
P(y|x) = P(a|x)
a∈B −1 (y)
This summing over all possible alignments makes way for the architecture to be trained
without requiring prior alignments. Given a target transcription y ∗ , the network is trained to
minimize the CTC cost function:
CTC provides the sequence of phonemes, predicting a phoneme only when it is very sure
of it’s presence and outputting the ’blank’ symbol for the rest of the frames. Traditional ap-
proaches heavily depend on the frame alignment and give poor results even on slight misalign-
ment, whereas CTC overcomes this obstacle and thus provides a more robust way of training
RNN for sequence transcription tasks without requiring any prior alignment information.
13
3.2 Sequence to sequence attention mechanism
In Sequence to Sequence (seq2seq) learning, RNN is trained to map an input sequence to an
output sequence which is not necessarily of the same length. It has been used in sequence
transduction tasks like neural machine translation, image captioning and automated question
answering. As the nature of speech and the transcriptions is sequential, seq2seq is a viable
proposition for speech recognition as well. In this paradigm, the seq2seq model (Figure 3.3)
attempts to learn to map the sequential variable-length input and output sequences with an en-
coder RNN that maps the variable-length input into a fixed length context vector representation.
This vector representation is used by a decoder to output the sequences which can vary in length.
An attention mechanism (Figure 3.4) can be used to produce a sequence of vectors from
the encoder RNN for each time-step of the input sequence. These basically refer to the adjacent
time frames the system should look at any particular instant to make a decision. The decoder
learns to pay selective attention to these vectors to produce the output at each time-step. The
attention vector is used to pass on information through the neural network at every input step
in contrast to simple seq2seq model. Attention mechanism is a dominant technique in Com-
puter Vision for visual recognition tasks where the neural network model focuses on a specific
region of the image with ’high resolution’ while perceiving the rest of the image in ’low reso-
lution’ originally inspired by human visual attention. It has been successfully applied to object
recognition and image captioning tasks.
Attention mechanism is different from CTC in the sense that it does not assume the outputs
to be conditionally independent of each other. As a result, it can learn the language model on
14
Figure 3.4: A seq2seq model with attention
itself in addition to the acoustic and pronunciation models jointly in a single network aided by
availability of large amounts of speech data.
3.3.1 Overview
The E2E LAS architecture takes acoustic features as inputs and the output of the network is
sequence of characters/phonemes depending on the way it is trained. Let x = (x1 , ..., xT ) be the
input sequence of filterbank spectra features, and let y = (hsosi, y1 , ..., ys ,heosi) be the output
sequence of characters/phonemes.
15
𝑦1 𝑦2 ... 𝑦𝑀−1 𝑦𝑀
...
𝑠1 𝑠2 ...
𝑠𝑀−1 𝑠𝑀
𝒉
Listener
𝑥1 𝑥2 ... 𝑥𝑁−1 𝑥𝑁
Input
Figure 3.5: Architecture of LAS network. It consists of three modules namely: listener (en-
coder), attender (alignment generator), and speller (decoder).
Each character output yi is modeled as a conditional distribution over the previous charac-
ters y< i and the input sequence x using the chain rule:
The LAS architecture is composed of two sub-modules: the listener and the speller. The listener
is the acoustic encoder that transforms the original input signal x into a higher level represen-
tation h. The AttendAndSpell function takes h as input and produces a probability distribution
over character/phoneme sequences:
h = Listen(x)
P (y|x) = AttendAndSpell(h, y)
16
3.3.2 Encoder (listener)
Listener network (Figure 3.5) employs a Bidirectional Long Short Term Memory (BLSTM)
employing a pyramidal structure to reduce the length of encoded vector. Pyramidal structure is
required as an utterance is a very long vector containing thousands of frames in length leading
to slower convergence. The structure of pyramidal BLSTM leads to a reduction in the number
of time-steps by a factor of two for each layer added leading to a reduction in computational
complexity and helps the attention to converge faster.
The Attend and Spell network (Figure 3.5) uses an attention-based LSTM transducer. This
network produces a distribution over the next character conditioned on all the characters emitted
before for each output time step.
Let the decoder state at a time step i be si which is a function of the previous state si-1 , yi-1
and ci-1 . The context vector ci is the output of attention mechanism.
ci = ContextForAttention(si , h)
si = RNN(si − 1 , yi − 1 , ci − 1 )
17
3.3.4 Learning
The Listen(.) and AttendAndSpell(.) functions are trained jointly to maximize the following
log probability:
X
max logP (yi |y∗< i ; θ)
θ
i
3.3.5 Decoding
After training the LAS maodel, we can extract the character sequence that is the most likely
from the input acoustic features:
For decoding, a left-to-right beam search is used which is extensively used in machine transla-
tion tasks.
3.4 Implementation
TensorFlowr [24] is a deep learning toolbox from Google which aids in the implementation of
deep learning models. Particularly, we use an open source ASR toolkit Nabu [25] which uses
TensorFlowr as it’s backend. The toolkit works in several stages: data preparation, training
and finally testing and decoding. It supports various essential models namely CNN + BLSTM
or pyramid BLSTM based encoder, dot-product/locally-aware attention, LSTM/RNN decoder
and Kaldi (famous conventional ASR modeling toolkit) style data processing, feature extrac-
tion/format, and recipes. This helps us design architectures of our choice and easily experiment
with various hyperparameters of our model. While working with deep learning models, large
computational resources are required. For implementation of our architecture and training, we
used an NVIDIA Quadro GPU with 2 GB of memory aided by 128 GB of RAM. Deep learn-
ing libraries tend to throw compatibility issues because of mismatch in versions. In particular,
TensorFlowr version 1.8.0 has been used for performing the experiments.
18
Chapter 4
In this chapter, we explore E2E approaches for tackling two important problems in the code-
switching domain, i.e. automatic speech recognition and language identification. We first
present the conventional unified character set approach and highlight the problems with the
same. Next, our proposed target set reduction scheme is presented in detail. This is followed by
a detailed discussion on the proposed joint intra-sentential language identification scheme.
19
Hindi अ आ इ ई उ ऊ ऋ ए ऐ ओ औ क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न
characters प फ ब भ म य र ल व श ष स ह क़ ख़ ग़ ज़ झझ़ ड़ ढ़ फ़ ो ॉ ै ् ौ ृ ू ु ी े झ़ ा ँ ः ं
English
a b c d e f g h i j k l m n o p q r s t u v w x y z
characters
Common a aa i ii u uu rq ee ei o ou k kh g gh ng
th d dh nc ch j jh nj tx txh dx dxh nx t
Figure 4.1: The top two rows show the default Hindi and English character sets, respectively.
phone set p ph b bh m y r l w sh sx s h kq khq gq z
ai er ng oy jhq dxq dxhq f q hq mq ao ae au
The proposed reduced target labels covering both Hindi and English sets are shown in the bot-
tom row.
Hindi अ आ इ ई उ ऊ ऋ ए ऐ ओ औ क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न
characters प फ ब भ म य र ल व श ष स ह क़ ख़ ग़ ज़ झझ़ ड़ ढ़ फ़ ो ॉ ै ् ौ ृ ू ु ी े झ़ ा ँ ः ं
English characters a b c d e f g h i j k l m n o p q r s t u v w x y z
Common a aa i ii u uu rq ee ei o ou k kh g gh ng c ch j jh nj tx txh dx dxh nx t th d dh n
phone set p ph b bh m y r l w sh sx s h kq khq gq z jhq dxq dxhq f q hq mq ao ae au ai er ng oy
Hindi अ आ इ ई उ ऊ ऋ ए ऐ ओ औ क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न
characters प फ ब भ म य र ल व श ष स ह क़ ख़ ग़ ज़ झझ़ ड़ ढ़ फ़
task even though the amount of data available for training remains the same. Thus, eventually
ं ो ॉ ै ् ौ ृ ू ु ी े झ़ ा ँ ः
English characters a b c d e f g h i j k l m n o p q r s t u v w x y z
Common a aa i ii u uu rq ee ei o ou k kh g gh ng c
t th d dh nch j jh nj tx txh dx dxh nx
comparatively
phone set
less amount of data is now available to learn reliableq representation
p ph b bh m y r l w sh sx s h kq khq gq z jhq dxq dxhq f
for each of
hq mq ao ae au ai er ng oy
the characters in the target set. As the amount of data available for training cannot be increased
due to availability constraints, we need to device intelligent models which can perform well
even in the absence of huge amount of data for training. Inspired by the acoustic similarity of
the languages involved in code-switching data and the fact that E2E ASR systems basically try
to learn the probability of emission of a target label given acoustic data, we propose a novel
target set reduction scheme for code-switching ASR which is presented in the next section.
Towards addressing those constraints, we explore a target set reduction scheme by exploiting
the acoustic similarity in the underlying languages of the code-switching task. This scheme
is primarily intended to enhance the performances of code-switching E2E ASR systems. The
validation of the proposed idea has been done on Hindi-English code-switching task using both
E2E network and hybrid deep neural network based hidden Markov model (DNN-HMM).
20
Figure 4.2: Sample examples for the proposed common phone level labelling and existing char-
acter level labelling schemes for E2E ASR system training. Note that for the given words, the
unique targets when tokenized at character level turns out to be 22 and 12 unique targets when
tokenized using the proposed scheme.
bharat bharat bh aa r a t
भारत भ ◌ा र त bh aa r a t
english english i ng l ii sh
अं ेज़ी अ ◌ं ग ◌् र ◌े ज ◌ी a ng r ei j ii
Despite the expansion of the target set in the case of code-switching E2E ASR, the phone sets
corresponding to the underlying languages may have significant acoustic similarity. This fact is
well known and has been exploited in the creation of a common phone set across languages [26].
Motivated by that, we propose a scheme for target set reduction in code-switching E2E ASR task
by creating common target labels based on acoustic similarity. In the following, the proposed
scheme has been explained in detail in the context of Hindi-English code-switching ASR task
which has been used in this work for experimentation. In principle, it can be extended to any
other code-switching context as well.
Hindi and English languages comprise of 68 and 26 characters, respectively. For reference
purpose, those are shown in the top two rows of Table 4.1. In [26], a composite phone set
covering major Indian languages is proposed in the context of computer processing. On a
similar line, a phone set for English has been defined. Combining the labels for Hindi and
English, a common phone set comprising 62 elements is derived and is shown in the bottom row
of Table 4.1. Using this common phone set, a dictionary keeping the default pronunciations for
all Hindi and English words in the HingCoS corpus (described in detail in Chapter 5) is created.
Now, the targets for acoustic modelling are derived by taking the pronunciation breakup of all
Hindi and English words. A few example words along with their default character-level and the
proposed common phone-level tokenizations are shown in Figure 4.2. It can be observed that
the considered Hindi/English words lead to 22 unique targets when tokenized at the character
level and 12 unique targets when tokenized using the proposed scheme. For the Hindi-English
code-switching task, the proposed approach results in 34% reduction in the size of the target
21
set. The importance of this reduction gets enhanced by the fact that the availability of code-
switching data is yet limited. Due to a smaller target set, the labels can now be learnt more
reliably even from a smaller amount of data. This will have a big impact on the performance of
the corresponding ASR systems. The experimental results discussed later in Chapter 6 support
this argument.
22
Figure 4.3: Creation of character-level LID tags for the training data towards condition-
ing the E2E networks to perform LID task on code-switching speech. The ‘H/E’ denotes
Hindi/English LID tag. The ‘b/e’ label is appended to the ‘H/E’ LID tag to mark the be-
gin/end characters.
23
Chapter 5
Experimental setup
5.1 Database
In this work, we explore the working of E2E ASR systems for 2 distinct tasks - monolingual
ASR and code-switching ASR. The databases used for each of these tasks are described below.
The monolingual experiments in this work are performed on a small standard speech corpus,
TIMIT. The TIMIT dataset includes phonetic and lexical speech transcriptions of American
English speakers of both genders hailing from eight different dialectical regions. The dataset
consists of a total of 6300 utterances which are spoken by a total of 630 speakers, with 10 sen-
tences per speaker. The alphabet contains a total of 48 phonemes including silence.
The entire dataset is divided into train, development and test sets with 8 sentences per
speaker as proposed by the creators of the dataset. The resulting distribution is as following
(each speaker contributing 8 utterances):
24
Figure 5.1: Sample code-switching sentences in HingCoS corpus and their corresponding En-
glish translations.
Hindi-English कयय आप मम झझ Deccan queen कय departure time बतय सकतझ हह |
code-switching Functions कझ नयम भभ lowercase letter सझ हभ शम र हहतझ हह |
sentences ककपयय ममझझ मझ रय current account balance बतयएए |
English Can you tell me the departure time of Deccan queen.
translated The names of the functions also start with a lowercase letter.
versions Please tell me my current account balance.
The code-switching experiments in this work are performed using the Hindi-English code
switching speech corpus. This database is referred to as the HingCoS corpus. A detailed
description of the same is available in [29]. A few example ones along with their English
translations are shown in Table 5.1.
It consists of 101 speakers, each of whom has been asked to sound 100 unique code-
switching sentences given to him/her. The length of those sentences varies from 3 to 60 words.
All speech data is recorded at 8 kHz sampling rate and 16-bits/sample resolution. The database
contains 9251 Hindi-English code-switching utterances which correspond to about 25 hours of
speech data. For ASR system modelling, the database is partitioned into train, development
and test sets containing 7015, 1152 and 2136 sentences, respectively. To study the effect of
utterance-length in decoding, three partitions of test set are created on the basis of length of
utterances. Those partitions correspond to utterance-length ranges as 3-15, 16-25, and 26-60
words and are referred to as Test1, Test2, and Test3, respectively. So obtained, Test1, Test2, and
Test3 data sets consist of 957, 719, and 460 utterances, respectively. The unified character set
modelling case comprises of 95 targets (26 English characters, 68 Hindi characters, and a word
separator). In contrast, the proposed scheme reduces that to 63 targets (62 common phones and
a word separator). In this work, we contrast the performances of the proposed reduced target
set based E2E ASR systems with those of unified character set based ones.
25
Sp + Dp + Ip
PER = × 100
Np
where,
• Sp : Number of substitutions
• Dp : Number of deletions
• Ip : Number of insertions
For LID task, the developed E2E systems are evaluated in terms of the LID error rate computed
as
NS + NI + ND
LID error rate = × 100
N
where, the numerator terms NS , NI , and ND refer to the number of substitutions, insertions, and
deletions, respectively. The denominator N refers to the total number of labels in the reference.
For this evaluation, the reference transcriptions for all test utterances labeled in terms of the
proposed LID tags are aligned with the corresponding outputs produced by the E2E network.
In addition to this character-level LID error rate, a corresponding word-level LID error rate is
also computed in a similar fashion by applying majority voting scheme [30] on the character-
level LID labels.
The raw audio samples are preprocessed to extract filterbank energies as features. The hyper-
parameters used for training the E2E LAS architecture on TIMIT database are summarized in
Table 5.1. Experiments on the well known database TIMIT have been performed to benchmark
our implementation of E2E ASR systems.
The E2E models developed in this work are trained using the Nabu toolkit [25], which is based
on TensorFlowr . For contrast purpose, the DNN-HMM systems have also been trained and
26
Hyperparameters Values
Encoder
Layers 2+1
Units/layer 128 each direction
Dropout 0.5
Decoder
Layers 2
Units/layer 128
Beam Width 16
Dropout 0.5
Training
Initial Learning Rate 0.2
Decay Exponential: 0.1
Batch Size 32
evaluated using Kaldi toolkit [31]. The parameter setting used for analyzing the speech data
include window length of 25 ms, window shift of 10 ms, and pre-emphasis factor of 0.97.
The 26-dimensional features comprising log filter-bank energies are used for developing E2E
systems. It is to be noted that, the E2E systems are optimized for the reduced target set and the
same parameters have been used for the unified character set systems. The remaining details of
the above mentioned systems are presented next.
The architectural details of the LAS model are as follows. The encoder has 3 pyramidal
DBLSTM layers with 512 units in each layer. The pyramidal step size is kept as 2 and the
dropout rate in training is set to 0.5. The LSTM decoder consists of 2 layers with 512 units in
each layer. The dropout rate for the LSTM decoder is also set to 0.5. Loss function used for
training is the average cross-entropy loss and Gaussian noise with σ = 0.6 is added to the data
while training. We have employed the beam-search decoder with beam width set as 16. The
model is trained for 400 epochs. Batch size used is 32 with learning rate decay fixed at 0.1.
27
CTC-based E2E model
This modelling paradigm involves a DBLSTM network as the encoder which consists of 4
layers and 256 units in each layer with dropout rate set to 0.5. The decoder utilizes CTC loss
function as discussed in Section 3.1. Gaussian noise with σ = 0.6 is added to the speech data for
modelling robustness. In model training, the number of epochs is set as 250 and the mini-batch
size is set to 32.
DNN-HMM model
The DNN-HMM acoustic model contains 5 hidden layers and 1024 nodes in each layer. The
hidden nodes use tanh as the non-linearity. First, 13-dimensional MFCC features are spliced
across ±3 frames to produce 91-dimensional feature vectors, which are then projected to 40
dimensions by applying linear discriminant analysis. These 40-dimensional feature vectors are
used for training the DNN-HMM acoustic model. The model is run for 20 epochs with a batch
size of 128.
In this subsection, we describe the tuning of parameters for both the developed LID systems
done on the development set defined earlier. In this work, the attention-based E2E LID system
is trained by employing the LAS network in which the encoder (listener) has 2 hidden layers,
each with 128 BLSTM nodes. The dropout rate of the encoder is set as 0.5. The number of
hidden layers and nodes of the decoder (speller) are kept same as that of the encoder, except
that the nodes are simple LSTMs. The LAS network is trained by setting the number of epochs
as 300, batch size as 32, and the learning rate decay as 0.1. During decoding, the beam width
is set as 8. For contrast purpose, a CTC-based E2E LID model is trained with 3 hidden layers,
each having 128 BLSTM nodes. The dropout rate of the encoder is set to be similar to that of
the LAS model. The network is trained with number of epochs as 300. Also, the parameters
corresponding to the batch size, and the learning rate decay are set as 8 and 0.1, respectively.
During decoding, the CTC cost function is employed to produce 1-best output sequence.
28
Chapter 6
In this chapter, we first discuss the results for E2E ASR task (monolingual ASR , code-switching
ASR and results of proposed target set reduction scheme) followed by the results obtained for
the proposed joint LID task.
Table 6.1: Comparison of previously reported phone error rates with our implementation on
core test set for TIMIT database
29
6.2 Results on HingCoS
In ASR task, for the unified target set case, the performances are measured in terms of the
character error rate (CER). Whereas, for the reduced target set case, we have used the phone
error rate (PER) as the measure. For proper evaluation, both attention and CTC based E2E ASR
systems are developed using reduced and unified target sets and their performances are reported
in Table 6.5. It can be observed that with proposed reduction in target set, all explored E2E
systems yield significantly improved recognition performance (i.e., target error rate) over the
corresponding unified target set based systems. Interestingly, this trend is carried over all the
three test sets as defined earlier. On comparing the reduced target set systems, we note that the
attention-based E2E ASR system has outperformed. Whereas, the CTC-based E2E system has
yielded slightly better CER for the unified target set modelling case.
Attention-based E2E 21.01 33.69 21.06 34.80 23.70 39.38 21.92 35.96
CTC-based E2E 32.91 35.82 28.89 32.85 28.33 33.87 30.04 34.18
Table 6.2: Evaluation of attention and CTC based E2E systems developed using both reduced
and unified target sets on Hindi-English code-switching data. The performances of reduced and
unified target set based systems are measured using phone error rate (PER) and character error
rate (CER), respectively. The performances for DNN-HMM system on those tasks are also
given for reference purpose.
It is worth emphasizing that with more reduction in the target set further improvement in
PERs could be achieved in reference to CERs. But any such reduction would be counterproduc-
tive if we can not derive accurate word sequences given the output hypotheses in terms of those
reduced target labels. That criterion is very much satisfied by the proposed phone based reduc-
tion of the target set in the case of code-switching speech. On the other hand, for unified target
set based E2E ASR systems, the decoded outputs may comprise of cross-language character
insertions due to acoustic similarity. Towards illustrating that, we show a few example decoded
sequences for both reduced and unified target set based E2E systems in Table 6.3. From that
table, we can note that the decoded sequence for the attention-based E2E system exhibits better
30
Target Set System Decoded sequence
Oracle a g a r _ aa p k o _ y a h _ p o s tx _ p a s a q d _ aa y aa _ h o _ t o _ aa p _ i s ee _ l ai k _ j a r uu r _ k a r ee
sequence
Unified modelling
Attention decoded asउ well
द ◌ा ह रas
ण _word
क ◌े _ लboundary
ि◌ ए _ क ि◌ स ◌ी _ क o m p in
marking _ क ◌े _ r e l a t e to
a n ycomparison ◌े ◌ं _of
d _मthat म ◌ेCTC-based
र ◌ी _ ज ◌ा न क ◌ा र ◌ी _ ह ◌ै
CTC decoded
system. This trend is attributed to the ability of attention-based E2E network to utilize all the
previous decoded labels along with the current input while making decisions.
Oracle a g a r _ aa p k o _ y a h _ p o s tx _ p a s a q d _ aa y aa _ h o _ t o _ aa p _ i s ee _ l ai k _ j a r uu r _ k a r ee
Table 6.3: Sample decoded outputs for E2E code-switching ASR systems developed using
reduced and unified target sets. The errors have been highlighted in bold. Note that, the symbol
‘ ’ is used to mark separation between the words.
Oracle m ee r ee _ b l ao g _ k aa _ tx ao p i k _ k y aa _ h o n aa _ c aa h i ee
CTC based m ee r ee _ b l ao g _ p a _ tx ao p i k _ k y aa _ c aa h i ee
Oracle म ◌े र ◌े _ b l o g _ क ◌ा _ t o p i c _ क ◌् य ◌ा _ ह ◌ो न ◌ा _ च ◌ा ह ि◌ ए
CTC based म ◌े र ◌े _ b l o g _ क ◌ा _ t o f i c _ क ◌् य ◌ा _ ह ◌ो न ◌ा _ च ◌ा ह ि◌ ए
Oracle i s k ee _ k o n _ s ee _ ei s ee _ f ii c er s _ h ei
CTC based i s k ee _ k au n _ s ee _ ei s ee _ c u _ h ei
Oracle इ स क ◌े _ क ◌ौ न _ स ◌े _ ऐ स ◌े _ f e a t u r e s _ ह ◌ै
CTC based इ स क ◌े _ क o म _ स ◌े _ ऐ स ◌े _ f e a t u r e _ ह ◌ै
Oracle y a h aa mq _ p a r _ aa p k o _ tx uu _ ao p sh a n _ m i l ee g aa
CTC based y a h aa mq _ p a r _ aa p k o _ tx o _ ao p n _ m i l ee g aa
Oracle य ह ◌ा ◌ँ _ प र _ आ प क ◌ो _ t w o _ o p t i o n _ म ि◌ ल ◌े ग ◌ा
CTC based य ह ◌ा ◌ँ _ प र _ आ प क ◌ो _ t w o _ o p ◌् t o n _ म ि◌ ल ◌े ग ◌ा
Table 6.4: Sample decoded outputs for E2E code-switching ASR systems developed using
reduced and unified target set for comparison across models.
31
LID Target LID error
ND NI NS
system label rate (%)
Table 6.5: Evaluation of the developed E2E LID systems in Hindi-English code-switching task.
The LID error rates have been computed both for character and word levels. The total number
of characters/words (N ) in the reference transcription is 198, 855/41, 025.
In this work, two different kinds of E2E joint LID systems are developed and evaluated on the
HingCoS corpus. The LID error rates computed both at character and word levels for these
systems are reported in Table 6.5. In contrast to CTC, the use of LAS architecture in E2E
LID system is noted to yield substantial reduction in the error rates. This is attributed to the
ability of attention mechanism in LAS network to accurately predict the languages switching in
the data. To highlight that, we have computed the language-specific averaged attention weights
with respect to the decoded LID label sequence and the plot for the same is shown in Figure 6.1.
The description of each of the subplots in Figure 6.1 is presented next.
Figure 6.1(a) shows the spectrogram of a typical Hindi-English speech utterance in the
test set. Note that, the spectrogram is manually labeled with spoken words and their boundaries
for the reference purposes. The variations of the averaged attention weights for Hindi and
English language targets present in the input speech data with respect to time, are shown in
Figure 6.1(b). The sequence alignment produced by the attention network for the input speech
data (on the x-axis) and the decoded output LID labels (on the y-axis) is plotted in Figure 6.1(c).
From Figures 6.1(b) and 6.1(c), we observe that the attention weights for Hindi and English
languages mostly peak around the corresponding word locations.
It is worth highlighting here that both CTC-based and attention-based E2E systems are
provided with identical target-level supervision while training. Unlike the attention-based sys-
tem, the CTC-based system could not exploit that supervision. This is attributed to the fact that
CTC assumes the outputs at different time steps to be conditionally independent, hence making
it less capable of learning the sequence. To support this argument, for the very utterance in
32
sil | इस | case | में | cash | को | select | कीजिए | sil
(a)
(b)
(c)
Figure 6.1: Visualization of attention mechanism for LID task. For a given Hindi-English
code-switching utterance: a) spectrogram labeled with Hindi and English word boundaries for
reference purpose. (b) variation of attention weights with respect to time for Hindi and English
languages, and (c) alignment produced by the attention network for the input speech and the
decoded output LID labels.
(a)
(b)
(c)
Figure 6.2: Visualization of attention mechanism for LID task - second example
33
Figure 6.1, the character level decoded outputs of the CTC- and attention-based E2E LID sys-
tems are listed in Table 6.6. The word-level LID labels for both the considered systems are also
shown in that table. On comparing the hypothesized sequence of output labels, it can be noted
that the inclusion of attention mechanism in E2E LID system leads to more effective language
identification within code-switching speech data.
Reference sequence Hb He | Eb E Ee | Hb H He | Eb E Ee | Hb He | Eb E E E E Ee | Hb H H H He
Character level
CTC-based hypothesis Hb E Ee | Eb E Ee | Eb He | Hb He | Eb Ee | Eb Ee | Hb He
LID lables
Attention-based hypothesis Hb He | Eb E E E Ee | Hb H He | Eb E E Ee | Hb He | Eb E E E E Ee | Hb H H H He
Table 6.6: The character and word level decoded outputs for CTC- and attention-based E2E
LID systems for the utterance considered in Figure 6.1. A majority voting scheme is employed
for mapping the character-level LID label sequences to word-level LID label sequences. The
attention-based system is able to decode the LID label sequences more accurately when com-
pared to the CTC-based system.
34
Chapter 7
In this work , we have built and tried to understand the working of E2E ASR systems and
have compared there performance to conventional DNN-HMM systems. Further, to extend
the prowess of these E2E systems to tasks like code-switching ASR where a large amount of
data is not available for training, we present a novel target label reduction scheme for training
the E2E code-switching ASR systems. The systems employing the reduced targets are shown
to outperform the unified target based systems. It has been demonstrated that the attention
based E2E system trained with reduced target set achieves the best averaged target (phone) er-
ror rate. Further, as code-switching data comprises of 2 or more languages, thus to build better
code-switching ASR systems, we also propose a joint E2E LID systems employing CTC and
attention mechanism for identifying the languages present in code-switching speech. The de-
velopment and evaluation of the proposed systems are done on Hindi-English code-switching
speech corpus. Towards developing the LID systems, a novel target labeling scheme has been
introduced which is found to be very effective for the attention-based system. On comparing
the attention and CTC mechanisms, the former is noted to achieve a two-fold reduction in both
character- and word-level LID error rates. The work also demonstrates the ability of the atten-
tion mechanism in detecting the language boundaries in code-switching speech data. Despite
the experiments being performed on Hindi-English code-switching data, the proposed approach
can easily be extended to other code-switching contexts.
Our aim is to now build a joint LID-ASR system for E2E ASR of code-switching speech
under the paradigm of multitask learning. as described in Section 1.5. This is motivated by a
recent work [36], where the authors reported improvement in Mandarin-English code-switching
ASR by employing multi-task learning with the LID labels. Further, we aim to play with the at-
35
tention mechanism by employing a supervised attention framework (with supervision provided
from LID task) for building more effective attention based E2E code-switching ASR systems.
Work along this direction has already started and we aim to complete the same by the next few
months.
36
Chapter 8
List of publications
1. Kunal Dhawan, Ganji Sreeram, Kumar Priyadarshi and Rohit Sinha, “Investigating
Target Set Reduction for End-to-End Speech Recognition of Hindi-English Code-Switching
Data”, submitted to Interspeech, 2019.
2. Ganji Sreeram, Kunal Dhawan, Kumar Priyadarshi and Rohit Sinha, “Joint Language
Identification of Code-Switching Speech using Attention based E2E Network”, submitted
to Interspeech, 2019.
Other publications
1. Sreeram Ganji, Kunal Dhawan and Rohit Sinha, “IITG-HingCoS Corpus: A Hinglish
Code-Switching Database for Automatic Speech Recognition”, Speech Communication
(2019), DOI:https://doi.org/10.1016/j.specom.2019.04.007
37
Bibliography
[4] A. Graves, “Sequence transduction with recurrent neural networks,” in Proc. of Interna-
tional Conference on Machine Learning: Representation Learning Workshop, 2012.
[5] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural
networks,” in Proc. of International Conference on Machine Learning, 2014, pp. 1764–
1772.
[6] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recog-
nition using attention-based recurrent NN: First results,” in Proc. of Deep Learning and
Representation Learning Workshop, 2014.
[7] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to
align and translate,” in Proc. of International Conference on Learning Representations,
2015.
38
[9] T. N. Sainath et al., “No need for a lexicon? evaluating the value of the pronunciation
lexica in end-to-end models,” CoRR, vol. abs/1712.01864, 2017.
[14] K. Bhuvanagirir and S. K. Kopparapu, “Mixed language speech recognition without ex-
plicit identification of language,” American Journal of Signal Processing, vol. 2, no. 5, pp.
92–97, 2012.
[15] B. H. Ahmed and T.-P. Tan, “Automatic speech recognition of code switching speech using
1-best rescoring,” in Proc. of International Conference on Asian Language Processing
(IALP), 2012, pp. 137–140.
[17] S. Malhotra, “Hindi-English, code switching and language choice in urban, uppermiddle-
class Indian families,” University of Kansas. Linguistics Graduate Student Association,
1980.
39
[20] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neu-
ral networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing, May 2013, pp. 6645–6649.
[21] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for
large vocabulary conversational speech recognition,” in 2016 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing, 2016, pp. 4960–4964.
[22] W. Geng et al., “End-to-end language identification using attention-based recurrent neural
networks,” in Proc. of Interspeech, 2016.
[26] B. Ramani et al., “A common attribute based unified HTS framework for speech synthesis
in Indian languages,” in Proc. of 8th ISCA Workshop on Speech Synthesis, 2013.
[27] K. A. H. Zirker, “Intrasentential vs. intersentential code switching in early and late bilin-
guals,” 2007.
[28] W. Chan et al., “Listen, attend and spell: A neural network for large vocabulary conversa-
tional speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech
and Signal Processing, pp. 4960–4964.
[30] B. Parhami, “Voting algorithms,” IEEE Transactions on Reliability, vol. 43, no. 4, pp.
617–629, 1994.
[31] D. Povey et al., “The Kaldi speech recognition toolkit,” IEEE Signal Processing Society,
2011.
40
[32] F. Sha and L. K. Saul, “Large margin gaussian mixture modeling for phonetic classification
and recognition,” in 2006 IEEE International Conference on Acoustics Speech and Signal
Processing, May 2006, pp. I–I.
[33] A. Graves, N. Jaitly, and A. Mohamed, “Hybrid speech recognition with deep bidirectional
lstm,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Dec
2013, pp. 273–278.
[34] O. Abdel-Hamid, l. Deng, and D. Yu, “Exploring convolutional neural network structures
and optimization techniques for speech recognition,” in Proc. of Interspeech, 2013.
[35] J. Ming and F. J. Smith, “Improved phone recognition using bayesian triphone models,” in
1998 IEEE International Conference on Acoustics Speech and Signal Processing, 1998,
pp. I–409.
[36] N. Luo, D. Jiang, S. Zhao, C. Gong, W. Zou, and X. Li, “Towards end-to-end code-
switching speech recognition,” arXiv preprint arXiv:1810.13091, 2018.
41