Ijeet 12 03 035
Ijeet 12 03 035
Ijeet 12 03 035
Volume 12, Issue 3, March 2021, pp. 269-277, Article ID: IJEET_12_03_035
Available online at http://iaeme.com/Home/issue/IJEET?Volume=12&Issue=3
ISSN Print: 0976-6545 and ISSN Online: 0976-6553
DOI: 10.34218/IJEET.12.3.2021.035
K. Gopakumar
Department of Electronics & Communication,
TKM College of Engineering, Kerala, India
ABSTRACT
This paper aims at producing a real-time system that reconstructs partially formed
words of persons with disability in speaking. Recurrent neural networks (RNN) have
been used extensively in the conversion and reconstruction of partially spoken words.
However, traditional RNN networks suffer from exploding and vanishing gradient
problems, reducing the learning rate, and affecting the overall performance of the
system. To avoid this, we use LSTM (long short term memory) systems. The obtained
text message is converted into an audio signal. MFCC (mel frequency cepstral
coefficient) technique is used for feature extraction. LSTMs include more control over
the output when compared to RNN which have only one function controlling the output
in a cell. We try to create a system that can reconstruct words spoken by people with
disorders like stuttering. Results indicate that RNN LSTMs offer promising solutions,
provided good training can be provided to the system model.
Keywords: recurrent neural network; mel frequency cepstral coefficient; speech;
assistive; long short term memory
Cite this Article: Lani Rachel Mathew, Arun Manohar, Nidheesh S, S. Sainath, Arsha
Vijayan and K. Gopakumar, Speech reconstruction using LSTM Networks,
International Journal of Electrical Engineering and Technology (IJEET), 12(3), 2021,
pp.269-277.
http://iaeme.com/Home/issue/IJEET?Volume=12&Issue=3
1. INTRODUCTION
Various speech assistive systems have been developed over the past decades to help people
facing difficulties in speaking. Machine learning techniques and deep learning are opening
different avenues in speech recognition [1,2,3], speech enhancement and speech analysis [4,5]
today. Speech recognition relies heavily on principles derived from pattern recognition and
neural network implementations [6,7].
For recognizing speech, the temporal nature of speech signals is better exploited with the
use of RNNs [8,9]. However, traditional RNNs faced the issue of exploding and vanishing
gradient problems during training, and hence various modifications were investigated. One
promising area of research is the use of LSTM networks [10]. RNN networks having memory
units are known as LSTM networks, which consist of cells, input gates, output gates and forget
gates. Cells remember the values obtained from the network over arbitrary time intervals and
the three gates present in the LSTM regulate the flow of information to and from the cell. LSTM
networks are very useful in classifying, processing and making predictions based on the time
series data.
This paper describes the use of an LSTM-based RNN system that converts disordered
speech (mainly speech with stuttering) to normal-sounding speech.
The paper is organized as follows. Section 2 describes the system overview. Section 3
presents an overview of the technology used in processing the speech. Feature extraction is
discussed in Section 4. The implementation details are given in Section 5. Section 6 describes
the results obtained, followed by the conclusions.
3. FEATURE EXTRACTION
Mel frequency cepstral coefficient (MFCC) is the most commonly used technology for feature
extraction in speech recognition applications [11, 13]. Speech is probably the most crucial tool
for communication in our daily lives. Therefore, constructing a speech recognition and
reconstruction system is always desirable. Basically, speech recognition is the process of
converting an acoustic signal to a set of words.
4. NETWORK ARCHITECTURE
A recurrent neural network (RNN) exhibits temporal dynamic behaviour for a time sequence.
Recurrent Neural Network is different from other networks by its ability to store the information
it has and an internal memory aid this. This memory helps the network to do special tasks like
language translation, unsegmented connection hand-writing recognition and speech
recognition. The main problem with RNN is that it lacks the capability to handle long term
dependencies, so LSTM is used. Like all RNNs, LSTM has repeating cells that performs the
same operations. The cell state is the output. It is easy for input to be passed out without changes
as output ie there are functions inside a cell and the values associated with each of them are
decided using training. If any functions are found to be leading to slower learning rates, then
decisions are made in the cell to make it output the same value as the input given to the cell
without any modifications. Gates are present in each cell to decide whether to let the
information pass through [9].
The sigmoid layer output will be between zero and one. Since it makes decisions regarding
whether to keep the input it is also called forget gate layer (value 1 is a representation to keep
the input x and 0 to forget a value). The equation of the forget layer function is given in Eq. 1
[10].
(1)
The next step is to decide what new information is going to be stored in the cell and this
step has two parts. First, a sigmoid layer called the “input gate layer” and it decides which value
is to be updated. Next, a tanh layer which creates a vector of new candidate values , that could
be added to the state. In the next step, these two are compared to create an update to the state.
The next step is to update the old cell state into the new cell state C t . We multiply the
old state by and then we add . This is the new candidate values, scaled by how much
to update each state value as seen in Eq. 4[10].
(4)
Finally, we need to decide what is given as output. This output will be based on our cell
state, but will be a filtered version. First, we run a sigmoid layer which decides the output. Then,
we pass the cell state value through tanh (to push the values to be between −1 and 1) and
multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
(6)
Fig. 2 shows the network architecture. The MFCCs are used as inputs to the various layers
of the RNN LSTM network. LSTM 1 is the first layer which contains 128 neurons to learn long
term dependencies. LSTM 2 is used as the hidden layer with LSTM architecture having 128
neurons. LSTM 2 is used because just LSTM 1 alone is not enough to provide the required
result. After LSTM 2 there is a dense layer which is a regular layer of neurons in a neural
network. Each neuron receives input from all the neurons in the previous layer. It is a fully
connected layer, with linear activation function.
5. IMPLEMENTATION
The RNN network was implemented using Tensorflow in Python with the help of Keras. Keras
works as an interface over Tensorflow and helps to reduce the actions needed to be taken by
the user. We used the Librosa library in Python to extract MFCC features from WAV file.
MFCC feature extraction was done on these audio samples, and the coefficients were passed
through the neural network having desired specifications such as input size, output size and
depth. We defined how many times the coefficients were used to train the neural network
(epochs). The batch size, i.e., the number of audio files used to change the parameters like
weight during training was also defined. After the defined number of epochs was completed,
the trained model was created and saved locally on the hard drive of the system. This model
was imported for real-time voice recognition.
Speech input samples given to the system, were saved in WAV format, and passed for pre-
processing, i.e, the MFCC feature extraction. The MFCCs were passed to the trained model
from the previous step. The trained model was loaded to predict the sample in the database
which most closely resembled the new input audio sample. We used a softmax layer as the last
layer. The output was predicted with the help of the model and the corresponding text output
was obtained. This text output was passed to a text to speech system. The GTTS tool was used
to convert the text output to an audio output.
5.1 Dataset
Data samples used for training were manually created with the help of a user having an issue
of stuttering. Over 2500 voice samples were created. These samples were based on 7 simple
words used in daily life - food, water, sleep, chair, hello, tablet and music. These words were
chosen with the idea that these will be commonly used by persons with disabilities. A total of
2972 voice recordings were created with these words. It took an average time of 1 hour to train
200 samples.
5.2 Training
Training was done using a learning rate of 0.001, with a batch size of 32, for 500 epochs. An
epoch is a measure of the number of times all of the training vectors are used once to update
the weights.
Fig. 3 shows the schematic for training. For batch training, all the training samples pass
through the learning algorithm simultaneously in one epoch before weights are updated. The
five parameters used for training were: the training data (train_X), target data (train_y),
validation split, the number of epochs and callbacks. The validation split was used to randomly
split the data into training data and testing data.
During training, we were able to see the validation loss, which give the mean squared error
of our model on the validation set.
Figure 5 (a) Listen box of the user interface (b) Text output obtained in the user interface
The text output will be further converted into an audio file after predicting the correct word
using GTTS. This audio is the output that the other people can hear so that the user can easily
communicate in the real world.
7. CONCLUSION
The system tries to find out the partially pronounced sound. Here recurrent neural network is
used for the recognition of the word. RNN LSTM networks have been used here to detect the
partially spoken words and the obtained text is converted to audio signal. MFCC technique is
used for feature extraction. In the future this system can be improved in such a way that it can
produce speech output according to the user’s pitch, age, and dialect.
REFERENCES
[1] Mohamad A. A. Al- Rababah, Abdusamad Al-Marghilani, Akram Aref Hamarshi,
“Automatic Detection Technique for Speech Recognition based on Neural Networks”,
International Journal of Advanced Computer Science and Applications, Volume 9, No. 3,
2018, pp 215-231.
[2] Wouter Gevaert, Georgi Tsenov,Valeri Mladenov, “Neural networks used for speech
recognition”, Journal of Automatic Control , January 2010, pp. 2120-2202.
[3] Y. Yu, "Research on Speech Recognition Technology and Its Application," Proceedings of
International Conference on Computer Science and Electronics Engineering, 2012, pp. 306-
309.
[4] M. Tu and X. Zhang, "Speech enhancement based on Deep Neural Networks with skip
connections," IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2017, pp. 5565-5569, doi: 10.1109/ICASSP.2017.7953221.
[5] Khaled Necibi, Halima Bahi and Toufik Sari, "Speech Disorders Recognition using Speech
Analysis", 2014, pp. 1-14. DOI: 10.4018/978-1-4666-4422-9.ch024.
[6] J. Meng, J. Zhang and H. Zhao, "Overview of the Speech Recognition Technology,"
Proceedings of International Conference on Computational and Information Sciences, 2012,
pp. 199-202. doi: 10.1109/ICCIS.2012.202.
[7] A.K. Jain, and J. Mao, "Neural Networks and Pattern Recognition," Computational
Intelligence, IEEE Press, 1994, pp. 194-212.
[8] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio, "Gated feedback
recurrent neural networks", Proceedings of the 32nd International Conference on
International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, 2015,
pp. 2067–2075.
[9] R. L. K. Venkateswarlu, Dr. R. Vasantha Kumari, G. Vani JayaSri, “Speech Recognition
by Using Recurrent Neural Networks”, International Journal of Scientific & Engineering
Research, Volume 2, Issue 6, 2011, pp. 1-7.
[10] A. Graves, A. Mohamed and G. Hinton, "Speech recognition with deep recurrent neural
networks," 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing, 2013, pp. 6645-6649, doi: 10.1109/ICASSP.2013.6638947.
[11] Sayf Majeed, Hafizah Husain, Salina Samad and Tarik Idbeaa, “Mel frequency cepstral
coefficients (Mfcc) feature extraction enhancement in the application of speech recognition:
A comparison study”, Journal of Theoretical and Applied Information Technology, 2015,
vol. 79, pp. 38-56.
[12] Google Text-to-Speech, https://gtts.readthedocs.io/en/latest/
[13] M. A. Hossan, S. Memon and M. A. Gregory, "A novel approach for MFCC feature
extraction," Proceedings of International Conference on Signal Processing and
Communication Systems, 2010, pp. 1-5.
[14] P. Kannan, S. K. Udayakumar and K. R. Ahmed, "Automation using voice recognition with
python," Proceedings of International Conference on Industrial Automation, Information
and Communications Technology, 2014, pp. 1-4.
[15] Giovanni Dimauro, Vincenzo Di Nicola, Vitoantonio Bevilocquq, Danilo Caivano,
Fransesco Girardi, “Assessment of Speech Intelligibility in Parkinson’s Disease Using a
Speech-To-Text System”, IEEE Access, vol. 5, 2017, pp. 22199 - 22208