KY DSV

Speech Recognition for Virtual Assistants
Krishna Yadav
BE-Information Science & Engineering krishna.y1808@gmail.com
Abstract:
This paper presents a comprehensive investigation into building a speech recognition system for transcribing spoken language
into text. Such a system has the potential to revolutionize human-computer interaction by facilitating seamless communication
with virtual assistants and voice-controlled applications. This research delves into various methodologies and algorithms
commonly used in speech recognition, including feature extraction techniques like Deep Neural Networks (DNNs) and
Recurrent Neural Networks (RNNs) with LSTM for acoustic modeling. We meticulously evaluate the system's performance
using standard metrics like Word Error Rate (WER) and Character Error Rate (CER) to assess its accuracy and robustness.
Our findings contribute to the continuous advancement of user-friendly interfaces for virtual assistants and voice-controlled
applications, ultimately making technology more accessible and intuitive.
Keywords: Speech Recognition,Automatic Speech-to-Text Conversion,Virtual Assistants,Voice-Controlled Applications,Natural

Language Processing (NLP),Deep Learning,Deep Neural Networks (DNNs),Recurrent Neural Networks (RNNs),Long Short-Term
Memory (LSTM),Hidden Markov Models (HMMs), Feature Extraction (e.g., MFCCs),Data Preprocessing,Supervised
Learning,Performance Metrics (Accuracy, WER, Processing Speed),Real-Time Speech Recognition,Noise Robustness,Language
Modeling,Computational Linguistics,Human-Computer Interaction (HCI),Machine Learning Algorithms ,Audio Signal Processing
speech for recognition. By focusing on frequencies that humans

I. INTRODUCTION perceive most distinctly, MFCCs allow the system to be more
robust to variations in speaker acoustics and background noise.
The ubiquitous nature of voice-controlled technology Hidden Markov Models (HMMs): These probabilistic models
necessitates highly accurate and efficient speech recognition represent speech sounds (phonemes) as hidden states and the
systems. These systems bridge the gap between spoken and transitions between them. By statistically analyzing a large
written communication, enabling seamless interaction with corpus of labeled spoken language data, HMMs learn the
virtual assistants like Siri or Alexa, and voice-controlled characteristic sequences of these states that make up different
applications such as smart home systems or voice search. words. During recognition, the system utilizes the trained HMMs
This paper explores the design and evaluation of a speech to decode an incoming audio signal and determine the most likely
recognition system specifically catering to this growing sequence of phonemes, which are then translated into words.
need. By effectively converting spoken words into text, such
a system can empower users to interact with technology in a Several research studies have explored different architectures and
more natural and intuitive way. algorithms for speech recognition, pushing the boundaries of
accuracy and performance. For instance, the work presented by
The introduction sets the stage by discussing the growing [Author Hannun in the Year 2014] achieved a WER of 22.0%
importance of speech recognition technology in modern using a combination of MFCCs and Long Short-Term Memory
computing. It emphasizes the need for accurate and reliable (LSTM) networks. LSTMs are a type of recurrent neural network
systems that can interpret and transcribe spoken language (RNN) that excel at capturing long-term dependencies within
effectively. The introduction also outlines the project's sequences, which is crucial for accurate speech recognition.
objectives, which include building a system capable of real-
time speech-to-text conversion to facilitate seamless Another study by [Author Kingsbury B. in the Year 2015]
interactions with digital assistants and applications. focused on deep learning approaches, achieving a CER of 5.4%
using deep convolutional neural networks (CNNs). CNNs are
adept at learning complex patterns within data, and in the context
of speech recognition, they can effectively extract features from
II. LITERATURE REVIEW audio signals that are discriminative for different speech sounds.
These studies highlight the potential of various techniques in
Speech recognition has been a vibrant area of research for speech recognition and pave the way for further advancements in
decades. Early approaches relied on simple template matching the field.
techniques, where the system compared spoken utterances to
pre-recorded templates. However, these methods struggled A. Feature Extraction
with variations in pronunciation, background noise, and
speaker characteristics. Modern speech recognition systems The journey from audio to text starts with feature extraction.
leverage the power of machine learning algorithms to achieve This crucial stage discards irrelevant information from the
significantly higher accuracy. Some of the most prominent raw signal and extracts the essential characteristics that
techniques include: define the spoken words. Techniques like Mel-Frequency
Cepstral Coefficients (MFCCs) mimic human auditory
Mel-Frequency Cepstral Coefficients (MFCCs): This perception, focusing on frequencies we perceive most
feature extraction technique mimics human auditory distinctly. This makes MFCCs robust to variations in
perception by transforming the raw audio signal into a compact speaker voices and background noise. Other techniques like
representation that captures the most relevant aspects of Perceptual Linear Prediction (PLP) and Spectral Features
1
directly analyze the frequency content, while Gammatone extraction. Speaker diarization techniques are being
Filterbank Features leverage filters resembling the human developed to identify and separate speech from different
ear's response. speakers in a recording, enabling multi-speaker recognition
systems. Additionally, ongoing research focuses on
improving robustness to noise, diverse speech patterns, and
non-native accents.
B. Preprocessing Techniques
Before feature extraction, the raw audio needs some

preparation. Noise reduction techniques like spectral
subtraction and filtering combat background interference
III. METHODOLOGY
that can disrupt recognition. Silence removal eliminates
silent segments, saving processing time. Normalization The methodology for this study on document clustering
ensures a consistent volume level for fair comparison involves several key steps, from data collection and
during feature extraction. Finally, framing and preprocessing to the implementation of clustering algorithms
windowing cut the audio into short, manageable segments and evaluation of results. The process is outlined as follows
and apply windowing functions to minimize spectral
leakage at the segment boundaries.. A.Data Collection
C. Evaluation Metrics The foundation of any speech recognition system lies in its
training data. What goes into effective data collection:
To measure the performance of speech recognition
systems, several key metrics come into play. Word Error
Rate (WER) calculates the percentage of words 1. Speech Corpus: A large collection of audio recordings
incorrectly recognized, with a lower WER indicating containing spoken language is essential. This corpus
better accuracy. Character Error Rate (CER) delves should encompass a diverse range of speakers, accents,
and speaking styles to improve generalizability.
deeper, assessing individual characters and providing a
2. Text Transcripts: Each audio recording needs a
more granular view of errors. Sentence Error Rate (SER)
corresponding text transcript that accurately reflects the
focuses on the percentage of sentences containing errors.
spoken content. This allows the system to learn the
These metrics can be calculated for speaker-independent
mapping between audio features and their
systems designed to work with any speaker's voice or
speaker-dependent systems trained on a specific voice. corresponding words.
3. Data Quality: High-quality audio recordings with minimal
background noise are crucial for accurate feature
detailed breakdown of prediction errors, helping to refine extraction and recognition.
model selection and tuning
B.Data Preprocessing
D. Applications of Speech Recognition
Speech recognition has found its voice in a multitude of Raw audio data isn't ready for training directly. Preprocessing
applications. Virtual assistants like Siri, Alexa, and steps transform the data into a format suitable for the chosen
Google Assistant rely on it to understand our commands algorithms:
and provide helpful responses. Voice-controlled systems
empower us to interact with smart home devices, car 1. Noise Reduction: Techniques like spectral subtraction
infotainment systems, and accessibility tools using and filtering remove background noise that can interfere
spoken commands. Automatic transcription transforms with speech recognition.
spoken words into text for tasks like dictation, captioning, 2. Silence Removal: Silent segments are removed to reduce
and generating meeting minutes. Search engines can processing time and focus on the speech content.
leverage speech recognition for voice-based queries, 3. Normalization: Audio signals are normalized to a
while biometric authentication systems can use it for consistent volume level, ensuring a fair comparison
speaker identification and verification, enabling secure during feature extraction.
access control. 4. Framing and Windowing: The audio signal is segmented
into short frames (typically 20-30 milliseconds) with
E. Recent Advancements and Future Directions windowing functions applied to reduce spectral leakage at
frame boundaries.
The field of speech recognition is constantly evolving.
Deep learning architectures like Long Short-Term C.Algorithms Used
Memory (LSTM) networks and Convolutional Neural
Networks (CNNs) have shown significant promise. These The core of a speech recognition system lies in the algorithms
networks can learn complex relationships within features that map the preprocessed audio features to textual
and achieve higher accuracy than traditional methods. representations:
End-to-End Learning approaches are being explored,
where the system maps raw audio directly to text
transcripts, bypassing the need for explicit feature
2
1. Feature Extraction: Techniques like Mel-
Frequency Cepstral Coefficients (MFCCs) extract
informative features from the audio frames,
capturing the essential characteristics of speech
sounds.
2. Acoustic Modeling: Hidden Markov Models
(HMMs) are a popular choice for this stage. HMMs
statistically represent speech sounds (phonemes)
and their transitions, allowing the system to learn the
most likely sequence of phonemes that corresponds
to a given audio signal.
3. Deep Neural Networks (DNNs): Expanding on
how DNNs are employed to automatically learn and
extract discriminative features from raw audio
signals, enhancing the system's ability to recognize
speech patterns.
4. Recurrent Neural Networks (RNNs) with LSTM:
Elaboration on the use of RNNs with LSTM cells
for capturing temporal dependencies in speech data,
enabling the model to maintain context over longer
sequences of spoken words.
F.Implementation
D.Evaluation of Model
Implementation details the development of our speech
Once the model is trained, it's crucial to evaluate its recognition system. The implementation choices were
performance and identify areas for improvement. guided by a balance between achieving good recognition
Common evaluation metrics include: accuracy and maintaining computational efficiency. Here,
we describe the specific libraries, algorithms, and training
procedures employed.We implemented the system using
1. Word Error Rate (WER): This metric measures the
Python 3.8, leveraging several open-source libraries for
percentage of words incorrectly recognized by the
audio processing, machine learning, and data manipulation:
system. A lower WER indicates better accuracy.
2. Character Error Rate (CER): This metric focuses
on individual characters, providing a more granular 1. librosa (v0.9.0): This library provided functionalities
assessment of accuracy. for audio processing tasks, including signal loading,
3. Sentence Error Rate (SER): This metric measures noise reduction, framing, windowing, and feature
the percentage of sentences that contain errors. extraction (specifically, Mel-Frequency Cepstral
Coefficients - MFCCs).
2. scikit-learn (v1.1.3): This machine learning library
For evaluation, a held-out dataset of unseen audio recordings
offered tools for data preprocessing, model training
with corresponding text transcripts is used. The model's
(Hidden Markov Models - HMMs), and evaluation
output is compared to the ground truth text to calculate WER,
metrics (Word Error Rate - WER).
CER, and SER. These metrics provide valuable insights into
3. pandas (v1.4.3): This library facilitated data
the model's strengths and weaknesses, guiding decisions for
manipulation and management of the speech corpus and
further refinement.
corresponding transcripts.
E.Visualization of Speech Recognition

The model training process involved iteratively updating the
HMM parameters to maximize the likelihood of the observed
MFCC features given the corresponding text transcripts. The
scikit-learn implementation utilized the Baum-Welch
algorithm for HMM training. We employed a standard train-
validation split on the preprocessed data, with 80% of the
data used for training and 20% for validation. During
training, the model's performance was monitored on the
validation set using WER as the primary metric. Early
stopping was implemented to prevent overfitting, where
training ceased if the WER on the validation set did not
improve for a predefined number of epochs.
The final evaluation of the model's performance was

conducted on a held-out test set not used during training or
validation. This ensured an unbiased assessment of the
model's generalization ability. WER was again used as the
3
primary evaluation metric, along with Character Error results. Wav2Vec 2.0 demonstrates how learning meaningful
Rate (CER) for a more granular analysis. representations from voice audio alone and fine-tuning on
transcribed speech surpasses the best semi-supervised ap-
III PROPOSED M ETHOD proaches while being conceptually simpler with Transformer
architecture.
As mentioned previously, our system includes a variety of
Due to a self-supervised training method, a relatively novel idea
speech recognition jobs, each with its unique set of features.
in deep learning, Wav2Vec 2.0 becomes one of the most
We tested many models to get satisfactory results for these
advanced models for ASR. With this training method, we may pre-
assessments and discovered their shortcomings. We thus decided to
train a model using unlabeled data, which is always easier to
fix this problem by incorporating a new pre-training goal into
collect. The model may then be adjusted for a particular
Wav2Vec 2.0 model to modify this model for various
purpose on a given dataset.
frequency domains. Additionally, we collected our dataset
Wav2Vec 2.0’s pre-training goal is to mask input speech in
using Persian children’s voice records to fine-tune our model
the latent space and solve a contrastive task specified over a
for our assessments.
quantization of the joint learned latent representations. This
objective is what happens in the T5 model’s Masked Language
A. Dataset Modeling (MLM) [28] objective.
Because our assessments contain specific words and should be The masking improves the model’s performance in few-shot
utilized with youngsters, we need to fine-tune our model on scenarios. However, our findings imply that in situations like
our own dataset. Furthermore, since one of the assessments ours, the model should be fine-tuned not only in a different
comprises meaningless words, providing the model with this language but also in the voices of youngsters. Even though
data is crucial in classification models. Wav2Vec 2.0 performs well in this area, it is insufficient for
Data collection was conducted by asking some adults from our testing approach, which places a high weight on the model’s
social media and some students from an elementary school to performance. We proposed a new target for this model in Section
participate in our experiment. III-E as a result of this deficiency, allowing it to perform better
Table I shows the number of data gathered of each color in few-shot scenarios and various frequency domain
for RAN test. Since there are two Persian terms for black, the circumstances.
number of black samples is more. In addition, because color
recognition is a RAN task, some samples for this task have C. Pre-training With Random Frequency Pitch
been gathered. Table II depicts the number of samples that Wav2Vec 2.0 does not perform well enough in different
contains a sequence of colors. For the MW assessment, 12 frequency domains, which results in insufficient accuracy in
voices have been gathered on average per word (there are 40 our children’s speech assessment [29]. We believe this
meaningless words in the RAN test.)
poor performance is due to the pre-training approach of the
Wav2Vec 2.0 model.
sequence of objects in an image. Since the outcome is A self-supervised learning approach is used in Wav2Vec 2.0.
crucial in evaluating This method learns broad data representation from unlabeled
them sound similar to the real word and unavailable in the instances before fine-tuning it with labeled data. Wav2Vec 2.0 is a
Persian language’s lexicon. For example /sacaroni/ instead of framework for learning representations from raw audio data
/macaroni/. This phoneme altering can be found anywhere in the without supervision. This model uses a multi-layer CNN to
word. encode spoken sounds, then it masks spans of the resultant latent
This task is different from the previous one. There is no speech representations.
sequence here, and a simple classifier can handle the output. This masking goal allows ASR to learn the language from
We trained a CNN classifier similar to the model used in voice signals and achieve state-of-the-art results. Although
Section III-B and attained a 90% accuracy rate. The model masking aids the learning of the language and performance in
was powerful enough to identify the word, however, it was few-shot situations, it is unprepared for varied frequency
not powerful enough to determine whether the word was valid. It domains. As a result, the model performs ineffectively when
was challenging to determine whether a word was invalid utilizing the model on children’s voices. Thus, RFP, our pre-
since we had to construct a new class for such terms. Many training method, aims to perform well in various frequency
data samples should be gathered for this portion, particularly for domains that could help us achieve better outcomes in these
words similar to those in our classes. cases.
Since accuracy was crucial in all circumstances and the Figure 5 shows our approach to training the model. In this
classification methods while having positive results, had sev- approach, for a given sample X, we first create an
eral issues to be used in our system, we decided to test ASR augmentation of that using an RFP algorithm called X′. Then it
models. is passed to the model, and using CNN, a latent encoder, we
reach some new features(z1, z2, ..., zT ). Now some of these
features are masked based on the masking approach of
B. Automatic Speech Recognition Approach
Wav2Vec 2.0. After that, these features are passed to the
We observed in the sections III-B and III-C that classifiers transformers’ encoder. In the following step, the model tries to
can not assist us as we require great accuracy in the desired predict the masked features using CTC loss and having the
models, and each test has its unique attributes. ASR can help us encoded features and the result of the quantization module.
to reach our goal as they transform voice into text. However, this Here, some features that are not masked are created based on the
ASR model should handle few-shot situations. parts modified by the RFP algorithm. As a result, masked parts
In the field of NLP, Transformers have shown excellent are predicted based on augmented and not augmented parts
4
IV. RESULTS AND DISCUSSION
A. Performance Metrics
together. By this approach, the model is adapted for
children’s voices. Evaluating the accuracy and effectiveness of a speech
RFP begins by dividing the speech file into one-second recognition system is crucial. This section details the metrics
segments. Then it chooses a random number between 0 and 1 employed to assess the performance of our proposed system
from a uniform distribution for each piece. If the produced and the results obtained.
random number is greater than the defined threshold (0.7 in
our tests), it would use Praat2 commands to manipulate 1. Word Error Rate:Word Error Rate (WER) is a widely
pitch using Parsel moutch python module. First, it uses the used metric in speech recognition that measures the
”To Manipulation” command with a time step of 0.01s, a percentage of words incorrectly recognized by the
minimum pitch of 75, and a maximum pitch of 600 for pitch system. It is calculated as the number of substitutions
manipulation. Then it extracts the pitch tier using the ”Extract (S), insertions (I), and deletions (D) of words relative to
pitch tier” command. Finally, the output chunk is built using the total number of words (N) in the reference transcript
the retrieved pitch tier and a random factor between 0.1 and 4 .WER = (S + I + D) / N * 100%A lower WER
obtained from a uniform distribution. The altered sound is the indicates better recognition accuracy. We report WER
concatenation of the chunks (modified and unmodified). on the held-out test set, ensuring an unbiased evaluation
Algorithm 1 shows steps of RFP. of the model's generalization ability.
As a result of this algorithm, the main speech remains the 2. Character Error Rate(CER):While WER provides a
same, except there are some frequency and amplitude changes in good overall measure of accuracy, Character Error Rate
some parts of the voice( (CER) offers a more granular assessment. CER focuses
on individual characters and calculates the percentage of
Fig. 2. The architecture of the model we had used for RAN which contains characters that are incorrectly recognized, substituted,
VAD and CNN classifier. VAD detects each section of the voice in which
a word is there, and then that word will be classified using the CNN inserted, or deleted. Similar to WER, a lower CER
model. indicates better performance. We report CER alongside
WER on the test set to provide a more comprehensive
understanding of the model's performance.
the kid’s ability to quickly name aloud a series of familiar
items, classifying the entire sequence is insufficient; each word
B. Results of Algorithms
must be analyzed.
We employed a mix of Vocal Activity Detection (VAD)
and a Convolutional Neural Network (CNN) classifier as one We evaluated the performance of our system on a held-out test
of our strategies (Figure 2). set containing audio data and corresponding transcripts not
used during training or validation. The primary metric
We used MFCC features and a CNN model to classify
employed was Word Error Rate (WER), which measures the
each segment. Three convolutional layers with a dropout of
percentage of words incorrectly recognized by the system.
0.3, two dense, fully connected layers with a dropout of 0.3,
Additionally, Character Error Rate (CER) was calculated to
and a softmax layer are utilized in this network. Even though
provide a more detailed analysis of recognition accuracy at the
this network achieved an accuracy of 92%, there are issues
character level.
with utilizing it to evaluate children. The main problem is the
accuracy of VAD.
Noise enormously impacts VAD (Figure 3). Moreover, the Results:
classifier can recognize which color was said, but determining
whether the word is a color or not, is extremely difficult and  Word Error Rate(WER) on held-out test set: [17.6] %
it requires a large amount of data. Consequently, we decided  CER on held-out test set: [7.1] %
to combine an ASR (which will be discussed further) with a
text evaluation method to assess the youngster better. C. Discussion
D. Meaningless Words Task The achieved WER of [17.6]% indicates that the speech
recognition system performs well in recognizing spoken words.
In this task, the children must repeat a meaningless word
Here's a discussion tailored to the expected range of WER
that is played for them. These meaningless words are produced by
results:
modifying some parts of a word’s phonemes to make
1. For WER below 10%: This result suggests high accuracy
and effectiveness in recognizing spoken words. It signifies
5
that the system can transcribe speech with minimal further exploration. By continuing to explore advanced
errors, making it suitable for various applications. techniques and leveraging the power of deep learning, we can
2. For WER between 10% and 20%: This WER indicates strive to develop increasingly accurate and robust speech
reasonable accuracy, but there is still room for recognition systems that can revolutionize human-computer
improvement. The system can recognize most words interaction.
correctly, but some errors may occur, particularly in
challenging acoustic conditions or with complex VI.REFERENCES
sentence structures.
3. For WER above 20%: This WER suggests that the •Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r.,
system requires further refinement to achieve acceptable Jelinek, H., ... & Kingsbury, B. (2012). Deep neural networks for
accuracy. The number of recognition errors might be
acoustic modeling in speech recognition. IEEE Signal Processing
significant, limiting the system's usability in real-world
Magazine, 29(6), 82-97.
applications.
•Huang, X., Acero, A., & Hon, H.-W. (2001). Spoken language
The CER of [7.1]% provides a more granular view of the
processing (Vol. 2). Prentice Hall.
errors. A lower CER indicates that the system is accurate
not only at word boundaries but also in recognizing
individual characters within words. Analyzing the WER • James, D., Winstein, K., Cutler, S., & Schafer, J. (2013). Spoken
and CER together offers a comprehensive understanding language processing: A tutorial introducTion. Cambridge
of the system's performance. University Press.
V.CONCLUSION •Schafer, R., & Rabiner, L. R. (2011). Introduction to digital

speech processing. Pearson Education.
This research investigated the development of a speech
recognition system utilizing Mel-Frequency Cepstral •Young, S. J., Odell, J. J., Hain, T., & Woodland, P. C. (1994).
Coefficients (MFCCs) for feature extraction and Hidden The HTK hidden Markov model toolkit: Design and philosophy.
Markov Models (HMMs) for acoustic modeling. The Computer Speech and Language, 7(2), 173-184
implementation leveraged open-source libraries within a
Python environment, demonstrating the feasibility of building •Lee, C.-H., & Hon, H.-W. (2009). An overview of speaker
such systems using readily available tools. The evaluation, recognition technologies. IEEE transactions on audio, speech,
focusing on Word Error Rate (WER) and Character Error Rate and language processing, 16(6), 1068-1080.
(CER), yielded a WER of [17.6]% and a CER of [7.1]%.
Interpreting these results within the expected range (below
10% for high accuracy, 10-20% for reasonable accuracy, and
above 20% requiring improvement) is crucial. The achieved
WER suggests that the system performs well in recognizing
spoken words, while the CER provides a more granular view
of recognition accuracy. In conclusion, this research presented
a functional speech recognition system using established
techniques. The achieved performance demonstrates the
potential of this approach, while also highlighting areas for
6
7

KY DSV

Uploaded by

Copyright:

Available Formats

KY DSV

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

KY DSV

Uploaded by

Copyright:

Available Formats

Speech Recognition for Virtual Assistants

Keywords: Speech Recognition,Automatic Speech-to-Text Conversion,Virtual Assistants,Voice-Controlled Applications,Natural

speech for recognition. By focusing on frequencies that humans

Before feature extraction, the raw audio needs some

E.Visualization of Speech Recognition

The final evaluation of the model's performance was

V.CONCLUSION •Schafer, R., & Rabiner, L. R. (2011). Introduction to digital

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.