Ijeet 12 03 035

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

International Journal of Electrical Engineering and Technology (IJEET)

Volume 12, Issue 3, March 2021, pp. 269-277, Article ID: IJEET_12_03_035
Available online at http://iaeme.com/Home/issue/IJEET?Volume=12&Issue=3
ISSN Print: 0976-6545 and ISSN Online: 0976-6553
DOI: 10.34218/IJEET.12.3.2021.035

© IAEME Publication Scopus Indexed

SPEECH RECONSTRUCTION USING LSTM


NETWORKS
Lani Rachel Mathew
Department of Electronics & Communication,
Mar Baselios College of Engineering & Technology, (Research Scholar,
LBS Centre for Science & Technology, University of Kerala), Kerala, India

Arun Manohar, Nidheesh S, S. Sainath and Arsha Vijayan


Department of Electronics & Communication,
Mar Baselios College of Engineering & Technology, Kerala, India

K. Gopakumar
Department of Electronics & Communication,
TKM College of Engineering, Kerala, India

ABSTRACT
This paper aims at producing a real-time system that reconstructs partially formed
words of persons with disability in speaking. Recurrent neural networks (RNN) have
been used extensively in the conversion and reconstruction of partially spoken words.
However, traditional RNN networks suffer from exploding and vanishing gradient
problems, reducing the learning rate, and affecting the overall performance of the
system. To avoid this, we use LSTM (long short term memory) systems. The obtained
text message is converted into an audio signal. MFCC (mel frequency cepstral
coefficient) technique is used for feature extraction. LSTMs include more control over
the output when compared to RNN which have only one function controlling the output
in a cell. We try to create a system that can reconstruct words spoken by people with
disorders like stuttering. Results indicate that RNN LSTMs offer promising solutions,
provided good training can be provided to the system model.
Keywords: recurrent neural network; mel frequency cepstral coefficient; speech;
assistive; long short term memory
Cite this Article: Lani Rachel Mathew, Arun Manohar, Nidheesh S, S. Sainath, Arsha
Vijayan and K. Gopakumar, Speech reconstruction using LSTM Networks,
International Journal of Electrical Engineering and Technology (IJEET), 12(3), 2021,
pp.269-277.
http://iaeme.com/Home/issue/IJEET?Volume=12&Issue=3

http://iaeme.com/Home/journal/IJEET 269 editor@iaeme.com


Speech reconstruction using LSTM Networks

1. INTRODUCTION
Various speech assistive systems have been developed over the past decades to help people
facing difficulties in speaking. Machine learning techniques and deep learning are opening
different avenues in speech recognition [1,2,3], speech enhancement and speech analysis [4,5]
today. Speech recognition relies heavily on principles derived from pattern recognition and
neural network implementations [6,7].
For recognizing speech, the temporal nature of speech signals is better exploited with the
use of RNNs [8,9]. However, traditional RNNs faced the issue of exploding and vanishing
gradient problems during training, and hence various modifications were investigated. One
promising area of research is the use of LSTM networks [10]. RNN networks having memory
units are known as LSTM networks, which consist of cells, input gates, output gates and forget
gates. Cells remember the values obtained from the network over arbitrary time intervals and
the three gates present in the LSTM regulate the flow of information to and from the cell. LSTM
networks are very useful in classifying, processing and making predictions based on the time
series data.
This paper describes the use of an LSTM-based RNN system that converts disordered
speech (mainly speech with stuttering) to normal-sounding speech.
The paper is organized as follows. Section 2 describes the system overview. Section 3
presents an overview of the technology used in processing the speech. Feature extraction is
discussed in Section 4. The implementation details are given in Section 5. Section 6 describes
the results obtained, followed by the conclusions.

2. OVERVIEW OF THE SYSTEM

Figure 1 Schematic of the experimental set-up


The system reconstructs the partially formed words of a person with speaking disability.
The speech input given to the system is continuously monitored to check whether the sentence
has ended. While the sentence is being spoken, MFCC-based feature extraction is performed
[11]. Feature vectors are passed to the speech recognition block consisting of RNN LSTM
network. The database for the disordered speech is searched for the best match, and the
corresponding clean speech sample is selected for speech synthesis.
The next step is speech synthesis, the artificial production of human speech. A neural
network is trained for a single person using a database and it is used to synthesize the audio
output from the text-to-speech (TTS) system. For TTS we use Google text-to-speech (GTTS)
to generate audio output [12]. GTTS is an API developed by the google to help developers
integrate speech recognition easily in any system. Using the API reduces the workload on the
processor and helps in making the system real time.

http://iaeme.com/Home/journal/IJEET 270 editor@iaeme.com


Lani Rachel Mathew, Arun Manohar, Nidheesh S, S. Sainath, Arsha Vijayan and K. Gopakumar

3. FEATURE EXTRACTION
Mel frequency cepstral coefficient (MFCC) is the most commonly used technology for feature
extraction in speech recognition applications [11, 13]. Speech is probably the most crucial tool
for communication in our daily lives. Therefore, constructing a speech recognition and
reconstruction system is always desirable. Basically, speech recognition is the process of
converting an acoustic signal to a set of words.

Figure 2 Steps involved in MFCC


Framing The speech signal is normally divided into small duration blocks, called frames,
and the spectral analysis is carried out on these frames. This is due to the fact that the human
speech signal is slowly time varying and can be treated as a quasistationary process. The frame
length and frame shift for the speech recognition task used typically are 20-30 ms and 10 ms
respectively
Windowing After framing, each frame is multiplied by a window function prior to reduce
the effect of discontinuity introduced by the framing process by attenuating the values of the
samples at the beginning and end of each frame. The Hamming window is commonly used, it
decreases the frequency resolution of the spectral analysis while reducing the sidelobe level of
the window transfer function. Hamming window is used for speech recognition.
Mel Filtering A group of triangle band pass filters that simulate the characteristics of the
human's ear are applied to the spectrum of the speech signal. This process is called Mel filtering.
The human ears analyse the sound spectrum in groups based on several overlapped critical
bands. These bands are distributed in a manner that the frequency resolution is high in the low
frequency region and low in the high frequency region.
Logarithm In this stage it approximates the relationship between the human's perception
of the loudness and the sound intensity.
Discrete Cosine Transform The cepstral coefficients are obtained after applying the DCT
on the log Mel filterbank coefficients. The higher order coefficients represent the excitation
information, or the periodicity in the waveform, while the lower order cepstral coefficients
represent the vocal tract shape or smooth spectral shape.MFCC-based extraction uses DCT
(Discrete cosine transform) in reconstruction of the speech signal, which helps to represent the
audio input using lesser number of coefficients.

4. NETWORK ARCHITECTURE
A recurrent neural network (RNN) exhibits temporal dynamic behaviour for a time sequence.
Recurrent Neural Network is different from other networks by its ability to store the information
it has and an internal memory aid this. This memory helps the network to do special tasks like
language translation, unsegmented connection hand-writing recognition and speech
recognition. The main problem with RNN is that it lacks the capability to handle long term

http://iaeme.com/Home/journal/IJEET 271 editor@iaeme.com


Speech reconstruction using LSTM Networks

dependencies, so LSTM is used. Like all RNNs, LSTM has repeating cells that performs the
same operations. The cell state is the output. It is easy for input to be passed out without changes
as output ie there are functions inside a cell and the values associated with each of them are
decided using training. If any functions are found to be leading to slower learning rates, then
decisions are made in the cell to make it output the same value as the input given to the cell
without any modifications. Gates are present in each cell to decide whether to let the
information pass through [9].
The sigmoid layer output will be between zero and one. Since it makes decisions regarding
whether to keep the input it is also called forget gate layer (value 1 is a representation to keep
the input x and 0 to forget a value). The equation of the forget layer function is given in Eq. 1
[10].
󰐾           (1)

The next step is to decide what new information is going to be stored in the cell and this
step has two parts. First, a sigmoid layer called the “input gate layer” and it decides which value
is to be updated. Next, a tanh layer which creates a vector of new candidate values  , that could
be added to the state. In the next step, these two are compared to create an update to the state.

󰐾           (2)

          (3)

The next step is to update the old cell state  into the new cell state C t . We multiply the
old state by 󰐾 and then we add 󰐾  . This is the new candidate values, scaled by how much
to update each state value as seen in Eq. 4[10].
         (4)

Finally, we need to decide what is given as output. This output will be based on our cell
state, but will be a filtered version. First, we run a sigmoid layer which decides the output. Then,
we pass the cell state value through tanh (to push the values to be between −1 and 1) and
multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

󰐾          (5)

󰐾     (6)

Fig. 2 shows the network architecture. The MFCCs are used as inputs to the various layers
of the RNN LSTM network. LSTM 1 is the first layer which contains 128 neurons to learn long
term dependencies. LSTM 2 is used as the hidden layer with LSTM architecture having 128
neurons. LSTM 2 is used because just LSTM 1 alone is not enough to provide the required
result. After LSTM 2 there is a dense layer which is a regular layer of neurons in a neural
network. Each neuron receives input from all the neurons in the previous layer. It is a fully
connected layer, with linear activation function.

http://iaeme.com/Home/journal/IJEET 272 editor@iaeme.com


Lani Rachel Mathew, Arun Manohar, Nidheesh S, S. Sainath, Arsha Vijayan and K. Gopakumar

Figure 2 Network architecture


In artificial neural networks, the activation function of a node defines the output of that node
given an input or set of inputs. A standard computer chip circuit is a digital network of activation
functions that can be "ON" or "OFF", depending on input. This passes the values from the
hidden layer to the softmax layer. Softmax is implemented through a neural network layer just
before the output layer. The softmax layer must have the same number of nodes as the output
layer. There are two variants for the softmax layer: Full softmax, i.e., it calculates a probability
for every possible class. Candidate sampling means that softmax calculates a probability for all
the positive labels but only for a random sample of negative labels. From the softmax layer, we
get the output.

5. IMPLEMENTATION
The RNN network was implemented using Tensorflow in Python with the help of Keras. Keras
works as an interface over Tensorflow and helps to reduce the actions needed to be taken by
the user. We used the Librosa library in Python to extract MFCC features from WAV file.
MFCC feature extraction was done on these audio samples, and the coefficients were passed
through the neural network having desired specifications such as input size, output size and
depth. We defined how many times the coefficients were used to train the neural network
(epochs). The batch size, i.e., the number of audio files used to change the parameters like
weight during training was also defined. After the defined number of epochs was completed,
the trained model was created and saved locally on the hard drive of the system. This model
was imported for real-time voice recognition.
Speech input samples given to the system, were saved in WAV format, and passed for pre-
processing, i.e, the MFCC feature extraction. The MFCCs were passed to the trained model
from the previous step. The trained model was loaded to predict the sample in the database
which most closely resembled the new input audio sample. We used a softmax layer as the last
layer. The output was predicted with the help of the model and the corresponding text output
was obtained. This text output was passed to a text to speech system. The GTTS tool was used
to convert the text output to an audio output.

5.1 Dataset
Data samples used for training were manually created with the help of a user having an issue
of stuttering. Over 2500 voice samples were created. These samples were based on 7 simple
words used in daily life - food, water, sleep, chair, hello, tablet and music. These words were
chosen with the idea that these will be commonly used by persons with disabilities. A total of

http://iaeme.com/Home/journal/IJEET 273 editor@iaeme.com


Speech reconstruction using LSTM Networks

2972 voice recordings were created with these words. It took an average time of 1 hour to train
200 samples.

Table 1 Total number of samples recorded


Word Male Female
Hello 655 -
Water 654 -
Food 202 -
Chair 250 251
Music 250 250
Sleep - 202
Table 1 gives the number of samples of each word that were recorded, e.g., 655 speech files
of ‘Hello’ were recorded for a male speaker. ‘Chair’ and ‘Music’ were recorded with both male
and female voice with a total of 501 and 500 samples, respectively. ‘Sleep’ was recorded only
with the female voice.

5.2 Training
Training was done using a learning rate of 0.001, with a batch size of 32, for 500 epochs. An
epoch is a measure of the number of times all of the training vectors are used once to update
the weights.
Fig. 3 shows the schematic for training. For batch training, all the training samples pass
through the learning algorithm simultaneously in one epoch before weights are updated. The
five parameters used for training were: the training data (train_X), target data (train_y),
validation split, the number of epochs and callbacks. The validation split was used to randomly
split the data into training data and testing data.

Figure 3 Schematic for training

http://iaeme.com/Home/journal/IJEET 274 editor@iaeme.com


Lani Rachel Mathew, Arun Manohar, Nidheesh S, S. Sainath, Arsha Vijayan and K. Gopakumar

During training, we were able to see the validation loss, which give the mean squared error
of our model on the validation set.

Figure 4 Training process


Fig. 4 shows the screenshot of the training process which took an average time of 7 hours.
The training was done for 1493 samples as seen in the figure.

5.3 User Interface


A simple user interface was designed to assist persons with speech disorders to interact easily
with others. Fig. 5 (a) shows the user interface in which we have a listen button. When a person
with a speech disorder clicks the listen button, and says a word, the input will be recognized
based on the trained model and converted to text. A recognizer box, shown in Fig. 5 (b), displays
the output of the program where the recognized word will be displayed.

Figure 5 (a) Listen box of the user interface (b) Text output obtained in the user interface
The text output will be further converted into an audio file after predicting the correct word
using GTTS. This audio is the output that the other people can hear so that the user can easily
communicate in the real world.

6. RESULTS AND DISCUSSION


We first used LSTM network with lesser cell size (around 30) and found the success rate was
less since the cell number was not enough to capture the information contained. We increased
the size and found that the training periods highly changed but helped to increase the success
rate. Further to add to the efficiency we added another hidden LSTM layer with the same no of
cells as the first one and this increased the amount of information captured.

http://iaeme.com/Home/journal/IJEET 275 editor@iaeme.com


Speech reconstruction using LSTM Networks

Table 2 Output success rate


Word No. of hits Total no. of trials Success Rate (%)
Hello 12 22 54.5
Water 10 20 50.0
Food 10 20 50.0
Chair 16 20 80.0
Music 8 20 40.0
Sleep 6 20 30.0
Average success rate 50.75
Table 2 shows the output success rate when each word was tested i.e., the word Hello was
spoken for a total of 22 times for testing out of which the output was correct for 12 time, hence
the success rate of Hello is 12/22. Average success rate is found to be 50.75%.
The low success rate may be because the training background noise and testing background
noise were not similar. By increasing the quality and amount of the training samples, more
accurate results are expected.
Fig. 6 depicts the loss of test and train. Here, number of epochs is plotted against loss.
Number of epochs refers to the number of times all the training vectors are used once to update
the weights. Increasing the epochs can improve the performance but at the same time it can also
cause overfitting. This problem causes the network to learn even the minor details like the noise
and smaller sounds that are in the samples that are used for training.

Figure 6 Loss graph


Finding the number of epochs needed for training is also a major step for creating an
effective model. Inadequate amount of data samples has led to increased losses for test. A
database with enough samples can help in the successful implementation of this system.

7. CONCLUSION
The system tries to find out the partially pronounced sound. Here recurrent neural network is
used for the recognition of the word. RNN LSTM networks have been used here to detect the
partially spoken words and the obtained text is converted to audio signal. MFCC technique is
used for feature extraction. In the future this system can be improved in such a way that it can
produce speech output according to the user’s pitch, age, and dialect.

http://iaeme.com/Home/journal/IJEET 276 editor@iaeme.com


Lani Rachel Mathew, Arun Manohar, Nidheesh S, S. Sainath, Arsha Vijayan and K. Gopakumar

REFERENCES
[1] Mohamad A. A. Al- Rababah, Abdusamad Al-Marghilani, Akram Aref Hamarshi,
“Automatic Detection Technique for Speech Recognition based on Neural Networks”,
International Journal of Advanced Computer Science and Applications, Volume 9, No. 3,
2018, pp 215-231.
[2] Wouter Gevaert, Georgi Tsenov,Valeri Mladenov, “Neural networks used for speech
recognition”, Journal of Automatic Control , January 2010, pp. 2120-2202.
[3] Y. Yu, "Research on Speech Recognition Technology and Its Application," Proceedings of
International Conference on Computer Science and Electronics Engineering, 2012, pp. 306-
309.
[4] M. Tu and X. Zhang, "Speech enhancement based on Deep Neural Networks with skip
connections," IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2017, pp. 5565-5569, doi: 10.1109/ICASSP.2017.7953221.
[5] Khaled Necibi, Halima Bahi and Toufik Sari, "Speech Disorders Recognition using Speech
Analysis", 2014, pp. 1-14. DOI: 10.4018/978-1-4666-4422-9.ch024.
[6] J. Meng, J. Zhang and H. Zhao, "Overview of the Speech Recognition Technology,"
Proceedings of International Conference on Computational and Information Sciences, 2012,
pp. 199-202. doi: 10.1109/ICCIS.2012.202.
[7] A.K. Jain, and J. Mao, "Neural Networks and Pattern Recognition," Computational
Intelligence, IEEE Press, 1994, pp. 194-212.
[8] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio, "Gated feedback
recurrent neural networks", Proceedings of the 32nd International Conference on
International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 2015,
pp. 2067–2075.
[9] R. L. K. Venkateswarlu, Dr. R. Vasantha Kumari, G. Vani JayaSri, “Speech Recognition
by Using Recurrent Neural Networks”, International Journal of Scientific & Engineering
Research, Volume 2, Issue 6, 2011, pp. 1-7.
[10] A. Graves, A. Mohamed and G. Hinton, "Speech recognition with deep recurrent neural
networks," 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing, 2013, pp. 6645-6649, doi: 10.1109/ICASSP.2013.6638947.
[11] Sayf Majeed, Hafizah Husain, Salina Samad and Tarik Idbeaa, “Mel frequency cepstral
coefficients (Mfcc) feature extraction enhancement in the application of speech recognition:
A comparison study”, Journal of Theoretical and Applied Information Technology, 2015,
vol. 79, pp. 38-56.
[12] Google Text-to-Speech, https://gtts.readthedocs.io/en/latest/
[13] M. A. Hossan, S. Memon and M. A. Gregory, "A novel approach for MFCC feature
extraction," Proceedings of International Conference on Signal Processing and
Communication Systems, 2010, pp. 1-5.
[14] P. Kannan, S. K. Udayakumar and K. R. Ahmed, "Automation using voice recognition with
python," Proceedings of International Conference on Industrial Automation, Information
and Communications Technology, 2014, pp. 1-4.
[15] Giovanni Dimauro, Vincenzo Di Nicola, Vitoantonio Bevilocquq, Danilo Caivano,
Fransesco Girardi, “Assessment of Speech Intelligibility in Parkinson’s Disease Using a
Speech-To-Text System”, IEEE Access, vol. 5, 2017, pp. 22199 - 22208

http://iaeme.com/Home/journal/IJEET 277 editor@iaeme.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy