Speech Recognition Using Neural Networks: A. Types of Speech Utterance
Speech Recognition Using Neural Networks: A. Types of Speech Utterance
Speech Recognition Using Neural Networks: A. Types of Speech Utterance
1.INTRODUCTION
Speech could be a useful interface to interact with machines .To improve this type of
communication, researches have been for a long time. From the evolution of computational
power, it has been possible to have system capable of real time conversions. But despite good
progression made in this field , the speech recognition is still facing a lot of problems. These
problems are due to the variations occurred in speaker including the variations because of age,
sex, speed of speech signal, emotional condition of the speaker can cause the difference in the
pronunciation of different persons. Surroundings can add noise to the signal. Sometimes speaker
causes the addition of noise itself. In speech recognition process, an acoustic signal captured by
microphone or telephone is converted to a set of characters. A view about automatic speech
recognition (ASR) is given by describing the integral part of future human computer interface.
Hence for the interaction with machines human could use speech as useful interface. Human
always want to achieve natural, possessive and simultaneous computing. Elham S. Salam
compared the effect of visual features on the performance of Speech Recognition System of
disorder people with audio speech recognition system. Comparison between different visual
features methods for selection is done and English isolated words are recognized. The
recognition of simple alphabet may be taken as a simple task for human beings. But due to the
occurrence of some problems like high acoustic similarities among certain group of letters,
speech recognition may be a challenging task . The use of conventional neural network of Multi-
Layer Perceptron is going to increase day by day. Work is well done as an effective classifier for
vowel sounds with stationary spectra by those networks. Feed forward multi-layer neural
network are not able to deal with time varying information like time-varying spectra of speech
sounds. This problem can be copied by incorporated feedback structure in the network.
Speech recognition system can be classified in several different types by describing the type of
speech utterance, type of speaker model and type of vocability that they have the ability to
recognize. The challenges are briefly explained below:
A. Types of speech utterance:
Speech recognition are classified according to what type of utterance they have ability to
recognize. They are classified as:
1) Isolated word: Isolated word recognizer usually requires each spoken word to have quiet (lack
of an audio signal) on both side of the sample window. It accepts single word at a time.
2) Connected word: It is similar to isolated word, but it allows separate utterances to run-together
which contains a minimum pause in between them.
3) Continuous Speech: it allows the users to speak naturally and in parallel the computer will
determine the content.
4) Spontaneous Speech: It is the type of speech which is natural sounding and is not rehearsed.
B. Types of speaker model:
Speech recognition system is broadly into two main categories based on speaker models namely
speaker dependent and speaker independent.
1) Speaker dependent models: These systems are designed for a specific speaker. They are easier
to developand more accurate but they are not so flexible.
2) Speaker independent models: These systems are designed for variety of speaker. These
systems are difficult to develop and less accurate but they are very much flexible.
C. Types of vocabulary:
The vocabulary size of speech recognition system affects the processing requirements, accuracy
and complexity of the system. In voice recognition system:speech-to-text the types of
vocabularies can be classified as follows:
1) Small vocabulary: single letter.
2) Medium vocabulary: two or three letter words.
3) Large vocabulary: more letter words.
1.1 Problem Definition:
The main objective of this project is all about, how to recognize the speech and to
convert into the text using Neural network.
To reduce the work of the human beign.
Saving the time during any work in emergency.
It will be useful for the Dum and Deaf people.
Create a tool to help people learn to speak English correctly in a effective way.
Engage people using the new technology.
1.2 Project overview
The architectural diagram of a typical voice and speaker recognition system is shown in
Figure 1. The system is trained to recognize the voice of individual speakers with each speaker
providing specific sets of utterances through a microphone terminal or telephone. The captured
analog voice waveform has three components: speech segment, silence or non-voiced segment,
and background noise signals. To extract the relevant speech signals, the voice waveform is
digitized and signal processing is carried out to remove the noise signals and the silence or non-
voiced components. Any relevant information that is ignored during this processing is
completely considered as lost and conversely, any irrelevant information such as fundamental
frequency of the speaker and the characteristics of the microphone that is allowed to pass is
treated as useful with implications on the speech feature classification performance. The
extracted speech signals are then converted into streams of template feature vectors of the voice
pattern for classification and training. In the event that irrelevant information is allowed, then the
speech features that may be generated from the corrupted speech signals may no longer be
similar to the class distributions that are learned from the training data. The system recognizes
the voice of individual speakers by comparing the extracted speech features of their utterances
with the respective template features invoked from the training systems. The GMM recognizer
computes scores that are used for the matching of the most distinctive speech features of
speakers. The decision criteria for the voice recognition of speakers were based on correlation
analysis of the speech features of speakers from the CNN and MFCC.
The speaker whose maximum voice characteristics are matches with the stored voice is identified
& the speaker whose voice characteristics are not matched is eligible for new entry in the
database (Fig).
2. LITERATURE SURVEY
2.1 Existing Systems:
Various types of voice and speaker recognition techniques are available. In this section, we
provide the literature review of work done in this field.
Esfandier Zavarehei and et al in the year 2005, studied that a time-frequency estimator for
enhancement of noisy speech signal in DFT domain is introduced. It is based on low order auto
regressive process which is used for modelling. The time-varying trajectory of DFT component
in speech which has been formed in Kalman filter state equation. For restarting Kalman filter, a
method has been formed to make alteration on the onsets of speech. The performance of this
method was compared with parametric spectral substraction and MMSE estimator for the
increment of noisy speech. The resultant of the proposed method is that residual noise is reduced
and quality of speech in improved using Kalman filters [2].
Puneet Kaur, Bhupender Singh and Neha Kapur in the year 2012 had discussed how to use
Hidden Markov Model in the process of recognition of speech. To develop an ASR(Automatic
Speech Recognition) system the essential three steps necessary are pre-processing, feature
Extraction and recognition and finally hidden markov model is used to get the desired result.
Research persons are continuously trying to develop a perfect ASR system as there are already
huge advancements in the field of digital signal processing but at the same time performance of
the computer are not so high in this field in terms of speed of response and matching accuracy.
The three different technique used by research fellows are acoustic phonetic approach, pattern
recognition approach and knowledge based approach[4].
Ibrahim Patel and et al in the year 2010, had discussed that frequency spectral information with
mel frequency is used to present as an approach in the recognition of speech for improvement of
speech, based on recognition approach which is represented in HMM. A combination of
frequency spectral information in the conventional Mel spectrum which is based on the approach
of speech recognition. The approach of Mel frequency utilize the frequency observation in
speech within a given resolution resulting in the overlapping of resolution feature which results
in the limit of recognition. In speech recognition system which is based on HMM, resolution
decomposition is used with a mapping approach in a separating frequency. The result of the
study is that there is an improvement in quality metrics of speech recognition with respect to the
computational time and learning accuracy in speech recognition system[6].
Hidden Markova Model:
Consider a system which may be described at any time as being in one of the state of set
of N distinct state, S1, S2, S3, · · · , SN ,. At regularly time interval system undergoes a change
of state ( possibly back to same state) according to set of probability associated with the state. we
denote time associated with state change as t =1,2 .., and we denote actual state at time t as qt . A
full probabilistic description of above system, would in general require specification of current
state and predecessor states. For the special case of discrete, first order , Markov chain, this
probabilistic description is truncated to just current and the predecessor state i.e
P[qt = Sj |qt−1 = Si , qt−2 = Sk, · · ·] = P[qt = Sj |qt−1 = Si ]
Furthermore we consider those processes in which the right hand side of above equation is
independent of time , thereby leading to the set of state transition probabilities aij of the form
aij = P[qt = Si |qt−1 = Sj ] 1 ≤ i, j ≤ N
with the state transition coefficients having the properties aij ≥ 0 X N j=1 aij = 1.
The above process could be called as an observable Markov Model, since the output of the process is
the set of states at each instant of time, where each state correspond to a observable event.
the wave formats of the speech signals are analyzed by it. In spectral analysis, the wave format
of the speech signal is analyzed by the spectral representation. Except all this, there are some
other tools that are necessary to study out are Mel Frequency Cestrum Coefficients (MFCC).
Mel frequency cestrum coefficient is preferred to extract the feature of speech signal. It
transforms the speech signal into frequency domain, hence training vectors are generated by it.
Another reason of using this method is that human learning is based on frequency analysis.
Before obtaining the MFCC of a speech signal the pre emphasis filtering is applied to the signal
with finite impulse response filter given by
This figure shows the general procedure of the speech recognition process.
Testing is the process, in which different speech signals are tested by using special type of neural
network. This is the main step in the speech recognition process. Testing of the speech signals is
done after training.
Pre-emphasis refers to a system process designed to increase, within a band of frequencies, the
magnitude of some (usually higher) frequencies with respect to the magnitude of the
others(usually lower) frequencies in order to improve the overall SNR. Hence, this step
processes the passing of signal through a filter which emphasizes higher frequencies. This
process will increase the energy of signal at higher frequency.
Framing:
The process of segmenting the speech samples obtained from an ADC into a small
frame with the length within the range of 20 to 40 msec. The voice signal is
divided into frames of N samples. Adjacent frames are being separated by M
(M<N).Typical values used are M = 100 and N= 256.
Hamming windowing :
Hamming window is used as window shape by considering the next block in
feature extraction processing chain and integrates all the closest frequency
lines. The Hamming window is represented as shown in Eq.
If the window is defined as W (n), 0 ≤ n ≤ N-1 where
N = number of samples in each frame
Y[n] = Output signal
X (n) = input signal
W (n) = Hamming window, then the result of windowing signal is shown below:
Y[n] =X (n)*W (n)
To convert each frame of N samples from time domain into frequency domain FFT
is being used. The Fourier Transform is used to convert the convolution of the
glottal pulse U[n] and the vocal tract impulse response H[n] in the time
domain. This statement supports as shown in Eq below:
Y(w)=FFT[h(t)*X(t)]=H(w)*X(w)
If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t)
respectively.
Energy=Σ X 2[t]
Where X[t] = signal
Each of the 13 delta features represents the change between frames
corresponding to cepstral or energy feature, while each of the 39 double delta
features represents the change between frames in the corresponding delta
features.
in the roles of red, green and blue, although, as described below, there is more than one alternative for
how precisely to bundle these into feature maps
In keeping with this metaphor, we need to use inputs that preserve locality in both axes of frequency and
time. Time presents no immediate problem from the standpoint of locality. Like other DNNs for speech, a
single window of input to the CNN will consist of a wide amount of context (9–15 frames). As for
frequency, the conventional use of MFCCs does present a major problem because the discrete cosine
transform projects the spectral energies into a new basis that may not maintain locality. In this paper, we
shall use the log-energy computed directly from the mel-frequency spectral coefficients (i.e., with no
DCT), which we will denote as MFSC features. These will be used to represent each speech frame, along
with their deltas and delta-deltas, in order to describe the acoustic energy distribution in each of several
different frequency bands.
Fig. 1. Two different ways can be used to organize speech input features to a CNN. The
above example assumes 40 MFCC features plus first and second derivatives with a context
window of 15 frames for each speech frame.
There exist several different alternatives to organizing these MFSC features into maps for the
CNN. First, as shown in Fig. 1(b), they can be arranged as three 2-D feature maps, each of which
represents MFSC features (static, delta and delta-delta) distributed along both frequency (using
the frequency band index) and time (using the frame number within each context window). In
this case, a two-dimensional convolution is performed (explained below) to normalize both
frequency and temporal variations simultaneously. Alternatively, we may only consider
normalizing frequency variations. In this case, the same MFSC features are organized as a
number of one-dimensional (1-D) feature maps (along the frequency band index), as shown in
Fig. 1(c). For example, if the context window contains 15 frames and 40 filter banks are used for
each frame, we will construct 45 (i.e., 15 times 3) 1-D feature maps, with each map having 40
dimensions, as shown in Fig. 1(c). As a result, a one-dimensional convolution will be applied
along the frequency axis. In this paper, we will only focus on this latter arrangement found in
Fig. 1(c), a one-dimensional convolution along frequency. Once the input feature maps are
formed, the convolution and pooling layers apply their respective operations to generate the
activations of the units in those layers, in sequence, as shown in Fig. 2. Similar to those of the
input layer, the units of the convolution and pooling layers can also be organized into maps. In
CNN terminology, a pair of convolution and pooling layers in Fig. 2 in succession is usually
referred to as one CNN “layer.” A deep CNN thus consists of two or more of these pairs in
succession. To avoid confusion, we will refer to convolution and pooling layers as convolution
and pooling plies, respectively.
B.Convolution Ply:
As shown in Fig. 2, every input feature map (assume I is the total number), is
connected to many feature maps (assume J is the total number), , in the
convolution ply based on a number of local weight matrices (I*J in total),
. The mapping can be represented as the well-known convolution
operation in signal processing. Assuming input feature maps are all one dimensional, each unit of
one feature map in the convolution ply can be computed as:
Fig. 2. An illustration of one CNN “layer” consisting of a pair of a convolution ply and a
pooling ply in succession.
where oi,m is the m-th unit of the i-th input feature map Oi,qj,m is the m-th unit of the j-th
feature map Qj in the convolution ply,wi,j,n is the nth element of the weight vector, wi,j, which
connects the th input feature map to the th feature map of the convolution ply. F is called the
filter size, which determines the number of frequency bands in each input feature map that each
unit in the convolution ply receives as input. Because of the locality that arises from our choice
of MFSC features, these feature maps are confined to a limited frequency range of the speech
signal.
A convolution ply differs from a standard, fully connected hidden layer in two important aspects,
however. First, each convolutional unit receives input only from a local area of the input. This
means that each unit represents some features of a local region of the input. Second, the units of
the convolution ply can themselves be organized into a number of feature maps, where all units
in the same feature map share the same weights but receive input from different locations of the
lower layer.
C.Pooling Ply:
As shown in Fig. 2, a pooling operation is applied to the convolution ply to generate its
corresponding pooling ply. The pooling ply is also organized into feature maps, and it has the
same number of feature maps as the number of feature maps in its convolution ply, but each map
is smaller. The purpose of the pooling ply is to reduce the resolution of feature maps. This means
that the units of this ply will serve as generalizations over the features of the lower convolution
ply, and, because these generalizations will again be spatially localized in frequency, they will
also be invariant to small variations in location. This reduction is achieved by applying a pooling
function to several units in a local region of a size determined by a parameter called pooling size.
It is usually a simple function such as maximization or averaging. The pooling function is
applied to each convolution feature map independently. When the max-pooling function is used,
the pooling ply is defined as:
where G is the pooling size , and s, the shift size, determines the overlap of adjacent pooling
windows.
where Oi represents the i-th input feature map and Wi,j represents each local weight matrix,
flipped to adhere to the convolution operation’s definition in the same mathematical form as the
fully connected ANN layer so that the same learning algorithm in Section II can be similarly
applied.
When one-dimensional feature maps are used, the convolution operations in eq. can be
represented as a simple matrix multiplication by introducing a large sparse weight matrix W as
shown in Fig. 4, which is formed by replicating a basic weight matrix W as in Fig. 4. The basic
matrix W is constructed from all of the local weight matrices,Wi,j , as follows:
where W is organized as I . F rows, where again F denotes filter size, each band contains I rows
for I input feature maps, and W has J columns representing the weights of J feature maps in the
convolution ply.
Fig. 4. All convolution operations in each convolution ply can be equivalently represented
as one large matrix multiplication involving a sparse weight matrix, where both local
connectivity and weight sharing can be represented in the structure of this sparse weight
matrix. This figure assumes a filter size of 5, 45 input feature maps and 80 feature maps in
the convolution pl.
The purposed Voice recognition system has been divided into two modules.
Fig.2.Speech Signal
Silence has been removed from the signal with the help of the zero crossing
rate and energy vector. Two energy threshold i.e. lower & upper thresholds are
calculated. If the energy level of the signal is beyond or less than the max
or min threshold that signal is considered as noise or silence and hence
removed .The required signal obtained is known as utterance as shown below in
the Fig.3.
Fig. 3. Utterance
The Utterance is divided into number of frames and then passes through a
discrete filter. In the Fig.4 a frame and its output obtained after passing it
through discrete filter has been shown.
Now this filtered signal is passed through the hamming window and then to
convert this time domain signal into frequency domain its 400 point FFT has
been found as shown in the Fig.5.
Further this signal is passed through mel bank having 24 filters, length of
the FFT is 512, sampling frequency used is 16000hz and then Sparse matrix
containing the filter bank amplitudes is calculated and with its help spectrum
as shown in Fig.6 is obtained which is the highest and lowest filters taper
down to zero.
We now have an array of numbers as shown in fig with each number representing the sound
wave’s amplitude at second intervals.
The output what we got from the MFCC is given to the CNN so that it will analyze the data and
calculate the weights and train the data to predict the Text to be obtained.
Convolutional Neural Network can do a lot of good things if they are fed with a bunch of
signals for instance to learn some basic signals such as frequency changes, amplitude
changes. Since, they are multi neural networks, the first layer is fed with this information.
The second layer is fed with some recognizable features.
The convolutional neural network match the parts of the signal instead of considering the
whole signal of pixels as it becomes difficult for a computer to identify the signal when the
whole set of pixels are considered [8][9]. The mathematics behind matching these is filtering.
The way this is done is by considering the feature that is lined up with this patch signal and
then one by one pixels are compared and multiplied by each other and then add it up and divide
it with the total number of pixels. This step is repeated for all the pixels that is
considered. The act of convolving signals with a bunch of filters, a bunch of features which
creates a stack of filtered images is called as convolutional layer. It is a layer because it is
operating based on stack that is in convolution one signal becomes a stack of filtered signals. We
get a lot of filtered signals because of the presence of the filters. Convolution layer is one
part.
The next big part is called as pooling that is how a signal stack can be compressed. This is done
by considering a small window pixel which might be a 2 by 2 window pixel or 3 by 3. On
considering a 2 by 2 window pixel and pass it in strides across the filtered signals, from each
window the maximum value is considered.
The standard feed-forward fully connected Neural network (NN) is a computational model
composed of several layers. An input to a particular unit is outputs of all the units in the previous
layer (or input data for the first layer). The unit output is a single linear regression, to which
output value a specific activation function is applied. Convolutional neural network (CNN) is a
type of NN where the input variables are related spatially to each other .
To take into account very important spatial positions, CNNs were developed. Not only they
are able to detect general spatial dependencies, but also are capable of specific patterns
recognition. Shared weights, representing different patterns, improve the convergence by
reducing significantly the number of parameters. CNN recognize small patterns at each
layer, generalizing them (detecting higher order, more complex patterns) in subsequent
layers. This allows detection of various patterns and keeps the number of weights to be learnt
very low.
Here we are using only 4 cnn layers, we an as many as we want to improve accuracy
based on the system we are working on.
To minimize the error we are using “Adam” optimizer and for measuring we are using
accuracy as metrics
The primary purpose of a Recognizer instance is, of course, to recognize speech. Each
instance comes with a variety of settings and functionality for recognizing speech from
an audio source.
Now, instead of using an audio file as the source, you will use the default system
microphone. You can access this by creating an instance of the Microphone class.
If your system has no default microphone (such as on a RaspberryPi), or you want to use
a microphone other than the default, you will need to specify which one to use by
supplying a device index. You can get a list of microphone names by calling
the list_microphone_names() static method of the Microphone class.
4.RESULTS
After running the code if you speak something , it predict as shown in fig.
Fig:Results
5.CONCLUSION
Voice recognition is computer analysis of the human voice, particularly for the target of
translating words and phrases and routinely identifying who is speaking on the foundation of
individual information incorporated in speech waves
We discussed two modules used for voice recognition system which are important
in improving its performance. First module provides the information that how
to extract MFCC coefficients from the voice signal and second module endow
with the algorithm that how to compare or match them with the already fed
user’s voice features using Convolution Neural Network .
The performance of the neural networks is being impacted largely by the pre- processing
technique. On the other hand, it is observed that Mel Frequency Cestrum Coefficients are very
reliable tool for the pre-processing stage. Very good results are provided by these coefficients.
REFERENCES
[1] Jingdong Chen, Member, Yiteng (Arden) Huang, Qi Li, Kuldip K. Paliwal, “Recognition of Noisy Speech
using Dynamic Spectral Subband Centroids” IEEE SSIGNAL PROCESSING LETTERS, Vol. 11, Number 2,
February 2004.
[2] Hakan Erdogan, Ruhi Sarikaya, Yuqing Gao, “Using semantic analysis to improve speech recognition
performance” Computer Speech and Language, ELSEVIER 2005.
[3] Chadawan Ittichaichareon, Patiyuth Pramkeaw, “Improving MFCC-based Speech Classification with FIR
Filter” International Conference on Computer Graphics, Simulation and Modelling (ICGSM‟2012) July 28-29,
2012 Pattaya(Thailand).
[4] Bhupinder Singh, Neha Kapur, Puneet Kaur “Sppech Recognition with Hidden Markov Model:A Review”
International Journal of Advanced Research in Computer and Software Engineering, Vol. 2, Issue 3, March
2012.
[5] Shivanker Dev Dhingra, Geeta Nijhawan, Poonam Pandit, “Isolated Speech Recognition using MFCC and
DTW” International Journal of Advance Research in Electrical, Electronics and Instrumentation Engineering,
Vol.2, Issue 8, August 2013.
[6] Ibrahim Patel, Dr. Y. Srinivas Rao, “Speech Recognition using HMM with MFCC-an analysis using
Frequency Spectral Decomposition Technique” Signal and Image Processing:An International Journal(SIPIJ),
Vol.1, Number.2, December 2010.
[7] Om Prakash Prabhakar, Navneet Kumar Sahu,”A Survey on Voice Command Recognition Technique”
International Journal of Advanced Research in Computer and Software Engineering, Vol 3,Issue 5,May 2013.
[8] M A Anusuya, “Speech recognition by Machine”, International Journal of Computer Science and
Information security, Vol. 6, number 3,2009.
[9] Sikha Gupta, Jafreezal Jaafar, Wan Fatimah wan Ahmad, Arpit Bansal, “Feature Extraction Using MFCC”
Signal & Image Processing:An International Journal, Vol 4, No. 4, August 2013.
[10] Kavita Sharma, Prateek Hakar “Speech Denoising Using Different Types of Filters” International journal of
Engineering Research and Applications Vol. 2, Issue 1, Jan-Feb 2012
[11] Elhan S. Salam, Reda A. El-Khoribi, Mahmoud E. Shoman, “Audio Visual Speech Recognition For People with Speech
Disorder” volume.96, No.2, June 2014
[12] R. B. Shinde, Dr. V. P. Pawar, “Vowel Classification Based on LPC and ANN” IJCA, volume.50, No.6, July 2012.