KY DSV
KY DSV
KY DSV
Krishna Yadav
BE-Information Science & Engineering krishna.y1808@gmail.com
Abstract:
This paper presents a comprehensive investigation into building a speech recognition system for transcribing spoken language
into text. Such a system has the potential to revolutionize human-computer interaction by facilitating seamless communication
with virtual assistants and voice-controlled applications. This research delves into various methodologies and algorithms
commonly used in speech recognition, including feature extraction techniques like Deep Neural Networks (DNNs) and
Recurrent Neural Networks (RNNs) with LSTM for acoustic modeling. We meticulously evaluate the system's performance
using standard metrics like Word Error Rate (WER) and Character Error Rate (CER) to assess its accuracy and robustness.
Our findings contribute to the continuous advancement of user-friendly interfaces for virtual assistants and voice-controlled
applications, ultimately making technology more accessible and intuitive.
1
directly analyze the frequency content, while Gammatone extraction. Speaker diarization techniques are being
Filterbank Features leverage filters resembling the human developed to identify and separate speech from different
ear's response. speakers in a recording, enabling multi-speaker recognition
systems. Additionally, ongoing research focuses on
improving robustness to noise, diverse speech patterns, and
non-native accents.
B. Preprocessing Techniques
C. Evaluation Metrics The foundation of any speech recognition system lies in its
training data. What goes into effective data collection:
To measure the performance of speech recognition
systems, several key metrics come into play. Word Error
Rate (WER) calculates the percentage of words 1. Speech Corpus: A large collection of audio recordings
incorrectly recognized, with a lower WER indicating containing spoken language is essential. This corpus
better accuracy. Character Error Rate (CER) delves should encompass a diverse range of speakers, accents,
and speaking styles to improve generalizability.
deeper, assessing individual characters and providing a
2. Text Transcripts: Each audio recording needs a
more granular view of errors. Sentence Error Rate (SER)
corresponding text transcript that accurately reflects the
focuses on the percentage of sentences containing errors.
spoken content. This allows the system to learn the
These metrics can be calculated for speaker-independent
mapping between audio features and their
systems designed to work with any speaker's voice or
speaker-dependent systems trained on a specific voice. corresponding words.
3. Data Quality: High-quality audio recordings with minimal
background noise are crucial for accurate feature
detailed breakdown of prediction errors, helping to refine extraction and recognition.
model selection and tuning
B.Data Preprocessing
D. Applications of Speech Recognition
Speech recognition has found its voice in a multitude of Raw audio data isn't ready for training directly. Preprocessing
applications. Virtual assistants like Siri, Alexa, and steps transform the data into a format suitable for the chosen
Google Assistant rely on it to understand our commands algorithms:
and provide helpful responses. Voice-controlled systems
empower us to interact with smart home devices, car 1. Noise Reduction: Techniques like spectral subtraction
infotainment systems, and accessibility tools using and filtering remove background noise that can interfere
spoken commands. Automatic transcription transforms with speech recognition.
spoken words into text for tasks like dictation, captioning, 2. Silence Removal: Silent segments are removed to reduce
and generating meeting minutes. Search engines can processing time and focus on the speech content.
leverage speech recognition for voice-based queries, 3. Normalization: Audio signals are normalized to a
while biometric authentication systems can use it for consistent volume level, ensuring a fair comparison
speaker identification and verification, enabling secure during feature extraction.
access control. 4. Framing and Windowing: The audio signal is segmented
into short frames (typically 20-30 milliseconds) with
E. Recent Advancements and Future Directions windowing functions applied to reduce spectral leakage at
frame boundaries.
The field of speech recognition is constantly evolving.
Deep learning architectures like Long Short-Term C.Algorithms Used
Memory (LSTM) networks and Convolutional Neural
Networks (CNNs) have shown significant promise. These The core of a speech recognition system lies in the algorithms
networks can learn complex relationships within features that map the preprocessed audio features to textual
and achieve higher accuracy than traditional methods. representations:
End-to-End Learning approaches are being explored,
where the system maps raw audio directly to text
transcripts, bypassing the need for explicit feature
2
1. Feature Extraction: Techniques like Mel-
Frequency Cepstral Coefficients (MFCCs) extract
informative features from the audio frames,
capturing the essential characteristics of speech
sounds.
2. Acoustic Modeling: Hidden Markov Models
(HMMs) are a popular choice for this stage. HMMs
statistically represent speech sounds (phonemes)
and their transitions, allowing the system to learn the
most likely sequence of phonemes that corresponds
to a given audio signal.
3. Deep Neural Networks (DNNs): Expanding on
how DNNs are employed to automatically learn and
extract discriminative features from raw audio
signals, enhancing the system's ability to recognize
speech patterns.
4. Recurrent Neural Networks (RNNs) with LSTM:
Elaboration on the use of RNNs with LSTM cells
for capturing temporal dependencies in speech data,
enabling the model to maintain context over longer
sequences of spoken words.
F.Implementation
D.Evaluation of Model
Implementation details the development of our speech
Once the model is trained, it's crucial to evaluate its recognition system. The implementation choices were
performance and identify areas for improvement. guided by a balance between achieving good recognition
Common evaluation metrics include: accuracy and maintaining computational efficiency. Here,
we describe the specific libraries, algorithms, and training
procedures employed.We implemented the system using
1. Word Error Rate (WER): This metric measures the
Python 3.8, leveraging several open-source libraries for
percentage of words incorrectly recognized by the
audio processing, machine learning, and data manipulation:
system. A lower WER indicates better accuracy.
2. Character Error Rate (CER): This metric focuses
on individual characters, providing a more granular 1. librosa (v0.9.0): This library provided functionalities
assessment of accuracy. for audio processing tasks, including signal loading,
3. Sentence Error Rate (SER): This metric measures noise reduction, framing, windowing, and feature
the percentage of sentences that contain errors. extraction (specifically, Mel-Frequency Cepstral
Coefficients - MFCCs).
2. scikit-learn (v1.1.3): This machine learning library
For evaluation, a held-out dataset of unseen audio recordings
offered tools for data preprocessing, model training
with corresponding text transcripts is used. The model's
(Hidden Markov Models - HMMs), and evaluation
output is compared to the ground truth text to calculate WER,
metrics (Word Error Rate - WER).
CER, and SER. These metrics provide valuable insights into
3. pandas (v1.4.3): This library facilitated data
the model's strengths and weaknesses, guiding decisions for
manipulation and management of the speech corpus and
further refinement.
corresponding transcripts.
3
primary evaluation metric, along with Character Error results. Wav2Vec 2.0 demonstrates how learning meaningful
Rate (CER) for a more granular analysis. representations from voice audio alone and fine-tuning on
transcribed speech surpasses the best semi-supervised ap-
III PROPOSED M ETHOD proaches while being conceptually simpler with Transformer
architecture.
As mentioned previously, our system includes a variety of
Due to a self-supervised training method, a relatively novel idea
speech recognition jobs, each with its unique set of features.
in deep learning, Wav2Vec 2.0 becomes one of the most
We tested many models to get satisfactory results for these
advanced models for ASR. With this training method, we may pre-
assessments and discovered their shortcomings. We thus decided to
train a model using unlabeled data, which is always easier to
fix this problem by incorporating a new pre-training goal into
collect. The model may then be adjusted for a particular
Wav2Vec 2.0 model to modify this model for various
purpose on a given dataset.
frequency domains. Additionally, we collected our dataset
Wav2Vec 2.0’s pre-training goal is to mask input speech in
using Persian children’s voice records to fine-tune our model
the latent space and solve a contrastive task specified over a
for our assessments.
quantization of the joint learned latent representations. This
objective is what happens in the T5 model’s Masked Language
A. Dataset Modeling (MLM) [28] objective.
Because our assessments contain specific words and should be The masking improves the model’s performance in few-shot
utilized with youngsters, we need to fine-tune our model on scenarios. However, our findings imply that in situations like
our own dataset. Furthermore, since one of the assessments ours, the model should be fine-tuned not only in a different
comprises meaningless words, providing the model with this language but also in the voices of youngsters. Even though
data is crucial in classification models. Wav2Vec 2.0 performs well in this area, it is insufficient for
Data collection was conducted by asking some adults from our testing approach, which places a high weight on the model’s
social media and some students from an elementary school to performance. We proposed a new target for this model in Section
participate in our experiment. III-E as a result of this deficiency, allowing it to perform better
Table I shows the number of data gathered of each color in few-shot scenarios and various frequency domain
for RAN test. Since there are two Persian terms for black, the circumstances.
number of black samples is more. In addition, because color
recognition is a RAN task, some samples for this task have C. Pre-training With Random Frequency Pitch
been gathered. Table II depicts the number of samples that Wav2Vec 2.0 does not perform well enough in different
contains a sequence of colors. For the MW assessment, 12 frequency domains, which results in insufficient accuracy in
voices have been gathered on average per word (there are 40 our children’s speech assessment [29]. We believe this
meaningless words in the RAN test.)
poor performance is due to the pre-training approach of the
Wav2Vec 2.0 model.
sequence of objects in an image. Since the outcome is A self-supervised learning approach is used in Wav2Vec 2.0.
crucial in evaluating This method learns broad data representation from unlabeled
them sound similar to the real word and unavailable in the instances before fine-tuning it with labeled data. Wav2Vec 2.0 is a
Persian language’s lexicon. For example /sacaroni/ instead of framework for learning representations from raw audio data
/macaroni/. This phoneme altering can be found anywhere in the without supervision. This model uses a multi-layer CNN to
word. encode spoken sounds, then it masks spans of the resultant latent
This task is different from the previous one. There is no speech representations.
sequence here, and a simple classifier can handle the output. This masking goal allows ASR to learn the language from
We trained a CNN classifier similar to the model used in voice signals and achieve state-of-the-art results. Although
Section III-B and attained a 90% accuracy rate. The model masking aids the learning of the language and performance in
was powerful enough to identify the word, however, it was few-shot situations, it is unprepared for varied frequency
not powerful enough to determine whether the word was valid. It domains. As a result, the model performs ineffectively when
was challenging to determine whether a word was invalid utilizing the model on children’s voices. Thus, RFP, our pre-
since we had to construct a new class for such terms. Many training method, aims to perform well in various frequency
data samples should be gathered for this portion, particularly for domains that could help us achieve better outcomes in these
words similar to those in our classes. cases.
Since accuracy was crucial in all circumstances and the Figure 5 shows our approach to training the model. In this
classification methods while having positive results, had sev- approach, for a given sample X, we first create an
eral issues to be used in our system, we decided to test ASR augmentation of that using an RFP algorithm called X′. Then it
models. is passed to the model, and using CNN, a latent encoder, we
reach some new features(z1, z2, ..., zT ). Now some of these
features are masked based on the masking approach of
B. Automatic Speech Recognition Approach
Wav2Vec 2.0. After that, these features are passed to the
We observed in the sections III-B and III-C that classifiers transformers’ encoder. In the following step, the model tries to
can not assist us as we require great accuracy in the desired predict the masked features using CTC loss and having the
models, and each test has its unique attributes. ASR can help us encoded features and the result of the quantization module.
to reach our goal as they transform voice into text. However, this Here, some features that are not masked are created based on the
ASR model should handle few-shot situations. parts modified by the RFP algorithm. As a result, masked parts
In the field of NLP, Transformers have shown excellent are predicted based on augmented and not augmented parts
4
IV. RESULTS AND DISCUSSION
A. Performance Metrics
together. By this approach, the model is adapted for
children’s voices. Evaluating the accuracy and effectiveness of a speech
RFP begins by dividing the speech file into one-second recognition system is crucial. This section details the metrics
segments. Then it chooses a random number between 0 and 1 employed to assess the performance of our proposed system
from a uniform distribution for each piece. If the produced and the results obtained.
random number is greater than the defined threshold (0.7 in
our tests), it would use Praat2 commands to manipulate 1. Word Error Rate:Word Error Rate (WER) is a widely
pitch using Parsel moutch python module. First, it uses the used metric in speech recognition that measures the
”To Manipulation” command with a time step of 0.01s, a percentage of words incorrectly recognized by the
minimum pitch of 75, and a maximum pitch of 600 for pitch system. It is calculated as the number of substitutions
manipulation. Then it extracts the pitch tier using the ”Extract (S), insertions (I), and deletions (D) of words relative to
pitch tier” command. Finally, the output chunk is built using the total number of words (N) in the reference transcript
the retrieved pitch tier and a random factor between 0.1 and 4 .WER = (S + I + D) / N * 100%A lower WER
obtained from a uniform distribution. The altered sound is the indicates better recognition accuracy. We report WER
concatenation of the chunks (modified and unmodified). on the held-out test set, ensuring an unbiased evaluation
Algorithm 1 shows steps of RFP. of the model's generalization ability.
As a result of this algorithm, the main speech remains the 2. Character Error Rate(CER):While WER provides a
same, except there are some frequency and amplitude changes in good overall measure of accuracy, Character Error Rate
some parts of the voice( (CER) offers a more granular assessment. CER focuses
on individual characters and calculates the percentage of
Fig. 2. The architecture of the model we had used for RAN which contains characters that are incorrectly recognized, substituted,
VAD and CNN classifier. VAD detects each section of the voice in which
a word is there, and then that word will be classified using the CNN inserted, or deleted. Similar to WER, a lower CER
model. indicates better performance. We report CER alongside
WER on the test set to provide a more comprehensive
understanding of the model's performance.
the kid’s ability to quickly name aloud a series of familiar
items, classifying the entire sequence is insufficient; each word
B. Results of Algorithms
must be analyzed.
We employed a mix of Vocal Activity Detection (VAD)
and a Convolutional Neural Network (CNN) classifier as one We evaluated the performance of our system on a held-out test
of our strategies (Figure 2). set containing audio data and corresponding transcripts not
used during training or validation. The primary metric
We used MFCC features and a CNN model to classify
employed was Word Error Rate (WER), which measures the
each segment. Three convolutional layers with a dropout of
percentage of words incorrectly recognized by the system.
0.3, two dense, fully connected layers with a dropout of 0.3,
Additionally, Character Error Rate (CER) was calculated to
and a softmax layer are utilized in this network. Even though
provide a more detailed analysis of recognition accuracy at the
this network achieved an accuracy of 92%, there are issues
character level.
with utilizing it to evaluate children. The main problem is the
accuracy of VAD.
Noise enormously impacts VAD (Figure 3). Moreover, the Results:
classifier can recognize which color was said, but determining
whether the word is a color or not, is extremely difficult and Word Error Rate(WER) on held-out test set: [17.6] %
it requires a large amount of data. Consequently, we decided CER on held-out test set: [7.1] %
to combine an ASR (which will be discussed further) with a
text evaluation method to assess the youngster better. C. Discussion
D. Meaningless Words Task The achieved WER of [17.6]% indicates that the speech
recognition system performs well in recognizing spoken words.
In this task, the children must repeat a meaningless word
Here's a discussion tailored to the expected range of WER
that is played for them. These meaningless words are produced by
results:
modifying some parts of a word’s phonemes to make
1. For WER below 10%: This result suggests high accuracy
and effectiveness in recognizing spoken words. It signifies
5
that the system can transcribe speech with minimal further exploration. By continuing to explore advanced
errors, making it suitable for various applications. techniques and leveraging the power of deep learning, we can
2. For WER between 10% and 20%: This WER indicates strive to develop increasingly accurate and robust speech
reasonable accuracy, but there is still room for recognition systems that can revolutionize human-computer
improvement. The system can recognize most words interaction.
correctly, but some errors may occur, particularly in
challenging acoustic conditions or with complex VI.REFERENCES
sentence structures.
3. For WER above 20%: This WER suggests that the •Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-r.,
system requires further refinement to achieve acceptable Jelinek, H., ... & Kingsbury, B. (2012). Deep neural networks for
accuracy. The number of recognition errors might be
acoustic modeling in speech recognition. IEEE Signal Processing
significant, limiting the system's usability in real-world
Magazine, 29(6), 82-97.
applications.
•Huang, X., Acero, A., & Hon, H.-W. (2001). Spoken language
The CER of [7.1]% provides a more granular view of the
processing (Vol. 2). Prentice Hall.
errors. A lower CER indicates that the system is accurate
not only at word boundaries but also in recognizing
individual characters within words. Analyzing the WER • James, D., Winstein, K., Cutler, S., & Schafer, J. (2013). Spoken
and CER together offers a comprehensive understanding language processing: A tutorial introducTion. Cambridge
of the system's performance. University Press.
6
7