Speech Recognition Using Neural Networks: A. Types of Speech Utterance

SPEECH RECOGNITION USING NEURAL NETWORKS
1.INTRODUCTION
Speech could be a useful interface to interact with machines .To improve this type of
communication, researches have been for a long time. From the evolution of computational
power, it has been possible to have system capable of real time conversions. But despite good
progression made in this field , the speech recognition is still facing a lot of problems. These
problems are due to the variations occurred in speaker including the variations because of age,
sex, speed of speech signal, emotional condition of the speaker can cause the difference in the
pronunciation of different persons. Surroundings can add noise to the signal. Sometimes speaker
causes the addition of noise itself. In speech recognition process, an acoustic signal captured by
microphone or telephone is converted to a set of characters. A view about automatic speech
recognition (ASR) is given by describing the integral part of future human computer interface.
Hence for the interaction with machines human could use speech as useful interface. Human
always want to achieve natural, possessive and simultaneous computing. Elham S. Salam
compared the effect of visual features on the performance of Speech Recognition System of
disorder people with audio speech recognition system. Comparison between different visual
features methods for selection is done and English isolated words are recognized. The
recognition of simple alphabet may be taken as a simple task for human beings. But due to the
occurrence of some problems like high acoustic similarities among certain group of letters,
speech recognition may be a challenging task . The use of conventional neural network of Multi-
Layer Perceptron is going to increase day by day. Work is well done as an effective classifier for
vowel sounds with stationary spectra by those networks. Feed forward multi-layer neural
network are not able to deal with time varying information like time-varying spectra of speech
sounds. This problem can be copied by incorporated feedback structure in the network.
Speech recognition system can be classified in several different types by describing the type of
speech utterance, type of speaker model and type of vocability that they have the ability to
recognize. The challenges are briefly explained below:
A. Types of speech utterance:
Speech recognition are classified according to what type of utterance they have ability to
recognize. They are classified as:
1) Isolated word: Isolated word recognizer usually requires each spoken word to have quiet (lack
of an audio signal) on both side of the sample window. It accepts single word at a time.
2) Connected word: It is similar to isolated word, but it allows separate utterances to run-together
which contains a minimum pause in between them.
3) Continuous Speech: it allows the users to speak naturally and in parallel the computer will
determine the content.
4) Spontaneous Speech: It is the type of speech which is natural sounding and is not rehearsed.
B. Types of speaker model:
Speech recognition system is broadly into two main categories based on speaker models namely
speaker dependent and speaker independent.
1) Speaker dependent models: These systems are designed for a specific speaker. They are easier
to developand more accurate but they are not so flexible.
DEPT OF COMPUTER SCIENCE Page 1

2) Speaker independent models: These systems are designed for variety of speaker. These
systems are difficult to develop and less accurate but they are very much flexible.
C. Types of vocabulary:
The vocabulary size of speech recognition system affects the processing requirements, accuracy
and complexity of the system. In voice recognition system:speech-to-text the types of
vocabularies can be classified as follows:
1) Small vocabulary: single letter.
2) Medium vocabulary: two or three letter words.
3) Large vocabulary: more letter words.
1.1 Problem Definition:
 The main objective of this project is all about, how to recognize the speech and to
convert into the text using Neural network.
 To reduce the work of the human beign.
 Saving the time during any work in emergency.
 It will be useful for the Dum and Deaf people.
 Create a tool to help people learn to speak English correctly in a effective way.
 Engage people using the new technology.
1.2 Project overview
The architectural diagram of a typical voice and speaker recognition system is shown in
Figure 1. The system is trained to recognize the voice of individual speakers with each speaker
providing specific sets of utterances through a microphone terminal or telephone. The captured
analog voice waveform has three components: speech segment, silence or non-voiced segment,
and background noise signals. To extract the relevant speech signals, the voice waveform is
digitized and signal processing is carried out to remove the noise signals and the silence or non-
voiced components. Any relevant information that is ignored during this processing is
completely considered as lost and conversely, any irrelevant information such as fundamental
frequency of the speaker and the characteristics of the microphone that is allowed to pass is
treated as useful with implications on the speech feature classification performance. The
extracted speech signals are then converted into streams of template feature vectors of the voice
pattern for classification and training. In the event that irrelevant information is allowed, then the
speech features that may be generated from the corrupted speech signals may no longer be
similar to the class distributions that are learned from the training data. The system recognizes
the voice of individual speakers by comparing the extracted speech features of their utterances
with the respective template features invoked from the training systems. The GMM recognizer
computes scores that are used for the matching of the most distinctive speech features of
speakers. The decision criteria for the voice recognition of speakers were based on correlation
analysis of the speech features of speakers from the CNN and MFCC.
The speaker whose maximum voice characteristics are matches with the stored voice is identified
& the speaker whose voice characteristics are not matched is eligible for new entry in the
database (Fig).

Fig: Block diagram of Speaker Identification System
1.3 Hardware specification:

 Laptop with minimum RAM of 8GB
 Intel I5 processor
 Microphone
 1TB hard disk
1.4 Software specification
 Telephonic Multisession dataset.
 Python programming.
 ANN Algorithm and MFCC.
 Jupyter Notebook.
 Train and Test data sets.

2. LITERATURE SURVEY
2.1 Existing Systems:
Various types of voice and speaker recognition techniques are available. In this section, we
provide the literature review of work done in this field.
Esfandier Zavarehei and et al in the year 2005, studied that a time-frequency estimator for
enhancement of noisy speech signal in DFT domain is introduced. It is based on low order auto
regressive process which is used for modelling. The time-varying trajectory of DFT component
in speech which has been formed in Kalman filter state equation. For restarting Kalman filter, a
method has been formed to make alteration on the onsets of speech. The performance of this
method was compared with parametric spectral substraction and MMSE estimator for the
increment of noisy speech. The resultant of the proposed method is that residual noise is reduced
and quality of speech in improved using Kalman filters [2].
Puneet Kaur, Bhupender Singh and Neha Kapur in the year 2012 had discussed how to use
Hidden Markov Model in the process of recognition of speech. To develop an ASR(Automatic
Speech Recognition) system the essential three steps necessary are pre-processing, feature
Extraction and recognition and finally hidden markov model is used to get the desired result.
Research persons are continuously trying to develop a perfect ASR system as there are already
huge advancements in the field of digital signal processing but at the same time performance of
the computer are not so high in this field in terms of speed of response and matching accuracy.
The three different technique used by research fellows are acoustic phonetic approach, pattern
recognition approach and knowledge based approach[4].
Ibrahim Patel and et al in the year 2010, had discussed that frequency spectral information with
mel frequency is used to present as an approach in the recognition of speech for improvement of
speech, based on recognition approach which is represented in HMM. A combination of
frequency spectral information in the conventional Mel spectrum which is based on the approach
of speech recognition. The approach of Mel frequency utilize the frequency observation in
speech within a given resolution resulting in the overlapping of resolution feature which results
in the limit of recognition. In speech recognition system which is based on HMM, resolution
decomposition is used with a mapping approach in a separating frequency. The result of the
study is that there is an improvement in quality metrics of speech recognition with respect to the
computational time and learning accuracy in speech recognition system[6].
Hidden Markova Model:
Consider a system which may be described at any time as being in one of the state of set
of N distinct state, S1, S2, S3, · · · , SN ,. At regularly time interval system undergoes a change
of state ( possibly back to same state) according to set of probability associated with the state. we
denote time associated with state change as t =1,2 .., and we denote actual state at time t as qt . A
full probabilistic description of above system, would in general require specification of current
state and predecessor states. For the special case of discrete, first order , Markov chain, this
probabilistic description is truncated to just current and the predecessor state i.e
P[qt = Sj |qt−1 = Si , qt−2 = Sk, · · ·] = P[qt = Sj |qt−1 = Si ]
Furthermore we consider those processes in which the right hand side of above equation is
independent of time , thereby leading to the set of state transition probabilities aij of the form
aij = P[qt = Si |qt−1 = Sj ] 1 ≤ i, j ≤ N

with the state transition coefficients having the properties aij ≥ 0 X N j=1 aij = 1.
The above process could be called as an observable Markov Model, since the output of the process is
the set of states at each instant of time, where each state correspond to a observable event.
Fig: Block diagram for speech recognition using HMM
2.2 Proposed System:

Classification of speech signal is very important phenomenon in speech recognition process.
Different models are introduced by different authors to classify the speech. But in this work of
project, neural network is to be used for classification. A neural network consists of small no of
neurons. A Number of neurons are interconnected. A Number of processing units which are used
for the processing of speech signals. The very simple techniques like pre-processing, filtering are
processed by these types of units. A non-linear weight is computed simply by each unit and the
result over its outgoing connection to other units is broadcast. Learning is a process in which
value of the appropriate weights is settled. It is necessary whenever we don’t have the complete
information about the input and output signal. The weights are adjusted by the proposed
algorithm to match the input and output characteristics of a network with the desired
characteristics. The desired response has to be assumed by our self with the help of teacher. In
this work of project, features of the pre- processed speech signals are extracted by using MFCC.
This is called training. The networks are usually trained to perform tasks such as pattern
recognition, decision making, and motoric control. Training of the unit is accomplished for the
adjustment of the weights and threshold for the classification CNN is used. The feature
extraction may be of two types as temporal analysis and spectral analysis. In temporal analysis,

the wave formats of the speech signals are analyzed by it. In spectral analysis, the wave format
of the speech signal is analyzed by the spectral representation. Except all this, there are some
other tools that are necessary to study out are Mel Frequency Cestrum Coefficients (MFCC).
Mel frequency cestrum coefficient is preferred to extract the feature of speech signal. It
transforms the speech signal into frequency domain, hence training vectors are generated by it.
Another reason of using this method is that human learning is based on frequency analysis.
Before obtaining the MFCC of a speech signal the pre emphasis filtering is applied to the signal
with finite impulse response filter given by
The value of 𝑎𝑝𝑟𝑒 is usually

taken between -1.0 to 0.4.
This figure shows the general procedure of the speech recognition process.
Fig: Block Diagram of Automatic Speech Recognition System
Testing is the process, in which different speech signals are tested by using special type of neural
network. This is the main step in the speech recognition process. Testing of the speech signals is
done after training.

3. SYSTEM ANALYSIS AND DESIGN

3.1 Requirement Specifications
3.1.1 MFCC as a voice recognition algorithm:

The extraction of the best parametric representation of acoustic signals is an important task to
produce a better recognition performance. The efficiency of this phase is important for the next
phase since it affects its behavior. MFCC is based on human hearing perceptions which cannot
perceive frequencies over 1KHz. In other words, MFCC is based on known variation of the
human ear’s critical bandwidth with frequency. MFCC has two types of filter which are spaced
linearly at low frequency below 1000 Hz and logarithmic spacing above 1000Hz. A subjective
pitch is present on Mel Frequency Scale to capture important characteristic of phonetic in
speech. The overall process of the MFCC is shown in Fig 1.
Fig:Block diagram for obtaining MFC coefficients

Pre-emphasis:
Pre-emphasis refers to a system process designed to increase, within a band of frequencies, the
magnitude of some (usually higher) frequencies with respect to the magnitude of the
others(usually lower) frequencies in order to improve the overall SNR. Hence, this step
processes the passing of signal through a filter which emphasizes higher frequencies. This
process will increase the energy of signal at higher frequency.
Framing:
The process of segmenting the speech samples obtained from an ADC into a small
frame with the length within the range of 20 to 40 msec. The voice signal is
divided into frames of N samples. Adjacent frames are being separated by M
(M<N).Typical values used are M = 100 and N= 256.

Hamming windowing :
Hamming window is used as window shape by considering the next block in
feature extraction processing chain and integrates all the closest frequency
lines. The Hamming window is represented as shown in Eq.
If the window is defined as W (n), 0 ≤ n ≤ N-1 where
N = number of samples in each frame
Y[n] = Output signal
X (n) = input signal
W (n) = Hamming window, then the result of windowing signal is shown below:
Y[n] =X (n)*W (n)
Fast fourier transform:
To convert each frame of N samples from time domain into frequency domain FFT
is being used. The Fourier Transform is used to convert the convolution of the
glottal pulse U[n] and the vocal tract impulse response H[n] in the time
domain. This statement supports as shown in Eq below:
Y(w)=FFT[h(t)*X(t)]=H(w)*X(w)
If X (w), H (w) and Y (w) are the Fourier Transform of X (t), H (t) and Y (t)
respectively.
Mel filter bank processing:

The frequencies range in FFT spectrum is very wide and voice signal does not
follow the linear scale. Each filter’s magnitude frequency response is
triangular in shape and equal to unity at the Centre frequency and decrease
linearly to zero at centre frequency of two adjacent filters.
Then, each filter output is the sum of its filtered spectral components.
After that the following equation as shown in “Eq. (3)” is used to compute
the Mel for given frequency f in HZ:
F (Mel ) =[2595 * log 10[1+ f /700]
Discrete cosine transform:
This is the process to convert the log Mel spectrum into time domain using
DCT. The result of the conversion is called Mel Frequency Cepstrum
Coefficient. The set of coefficient is called acoustic vectors. Therefore,
each input utterance is transformed into a sequence of acoustic vector.
Delta energy and delta spectrum:
The voice signal and the frames changes, such as the slope of a formant at its
transitions. Therefore, there is a need to add features related to the change
in cepstral features over time. 13 delta or velocity features (12 cepstral
features plus energy), and 39 features a double delta or acceleration feature
are added. The energy in a frame for a signal x in a window from time sample
t1 to time sample t2, is represented as shown below in Eq.

Energy=Σ X 2[t]
Where X[t] = signal
Each of the 13 delta features represents the change between frames
corresponding to cepstral or energy feature, while each of the 39 double delta
features represents the change between frames in the corresponding delta
features.
3.1.2 Convolution Neural Network:

The convolution neural network (CNN) can be regarded as a variant of the standard neural
network. Instead of using fully connected hidden layers as described in the preceding section, the
CNN introduces a special network structure, which consists of alternating so-called convolution
and pooling layers as shown in fig.
Fig:Convolution Neural Network
A Organization of the Input Data to the CNN:

In using the CNN for pattern recognition, the input data need to be organized as a number of feature maps
to be fed into the CNN. This is a term borrowed from image-processing applications, in which it is
intuitive to organize the input as a two-dimensional (2-D) array, being the pixel values X the Y
(horizontal and vertical) coordinate indices. For color images, RGB (red, green, blue) values can be
viewed as three different 2-D feature maps. CNNs run a small window over the input image at both
training and testing time, so that the weights of the network that looks through this window can learn
from various features of the input data regardless of their absolute position within the input. Weight
sharing, or to be more precise in our present situation, full weight sharing refers to the decision to use the
same weights at every positioning of the window. CNNs are also often said to be local because the
individual units that are computed at a particular positioning of the window depend upon features of the
local region of the image that the window currently looks upon
In this section, we discuss how to organize speech feature vectors into feature maps that are suitable for
CNN processing. The input “image” in question for our purposes can loosely be thought of as a
spectrogram, with static, delta and delta-delta features (i.e., first and second temporal derivatives) serving

in the roles of red, green and blue, although, as described below, there is more than one alternative for
how precisely to bundle these into feature maps
In keeping with this metaphor, we need to use inputs that preserve locality in both axes of frequency and
time. Time presents no immediate problem from the standpoint of locality. Like other DNNs for speech, a
single window of input to the CNN will consist of a wide amount of context (9–15 frames). As for
frequency, the conventional use of MFCCs does present a major problem because the discrete cosine
transform projects the spectral energies into a new basis that may not maintain locality. In this paper, we
shall use the log-energy computed directly from the mel-frequency spectral coefficients (i.e., with no
DCT), which we will denote as MFSC features. These will be used to represent each speech frame, along
with their deltas and delta-deltas, in order to describe the acoustic energy distribution in each of several
different frequency bands.
Fig. 1. Two different ways can be used to organize speech input features to a CNN. The
above example assumes 40 MFCC features plus first and second derivatives with a context
window of 15 frames for each speech frame.
There exist several different alternatives to organizing these MFSC features into maps for the
CNN. First, as shown in Fig. 1(b), they can be arranged as three 2-D feature maps, each of which
represents MFSC features (static, delta and delta-delta) distributed along both frequency (using
the frequency band index) and time (using the frame number within each context window). In
this case, a two-dimensional convolution is performed (explained below) to normalize both
frequency and temporal variations simultaneously. Alternatively, we may only consider
normalizing frequency variations. In this case, the same MFSC features are organized as a
number of one-dimensional (1-D) feature maps (along the frequency band index), as shown in
Fig. 1(c). For example, if the context window contains 15 frames and 40 filter banks are used for
each frame, we will construct 45 (i.e., 15 times 3) 1-D feature maps, with each map having 40
dimensions, as shown in Fig. 1(c). As a result, a one-dimensional convolution will be applied
along the frequency axis. In this paper, we will only focus on this latter arrangement found in
Fig. 1(c), a one-dimensional convolution along frequency. Once the input feature maps are
formed, the convolution and pooling layers apply their respective operations to generate the
activations of the units in those layers, in sequence, as shown in Fig. 2. Similar to those of the
input layer, the units of the convolution and pooling layers can also be organized into maps. In
CNN terminology, a pair of convolution and pooling layers in Fig. 2 in succession is usually
referred to as one CNN “layer.” A deep CNN thus consists of two or more of these pairs in

succession. To avoid confusion, we will refer to convolution and pooling layers as convolution
and pooling plies, respectively.
B.Convolution Ply:
As shown in Fig. 2, every input feature map (assume I is the total number), is
connected to many feature maps (assume J is the total number), , in the
convolution ply based on a number of local weight matrices (I*J in total),
. The mapping can be represented as the well-known convolution
operation in signal processing. Assuming input feature maps are all one dimensional, each unit of
one feature map in the convolution ply can be computed as:
Fig. 2. An illustration of one CNN “layer” consisting of a pair of a convolution ply and a
pooling ply in succession.
where oi,m is the m-th unit of the i-th input feature map Oi,qj,m is the m-th unit of the j-th
feature map Qj in the convolution ply,wi,j,n is the nth element of the weight vector, wi,j, which
connects the th input feature map to the th feature map of the convolution ply. F is called the
filter size, which determines the number of frequency bands in each input feature map that each
unit in the convolution ply receives as input. Because of the locality that arises from our choice
of MFSC features, these feature maps are confined to a limited frequency range of the speech
signal.
A convolution ply differs from a standard, fully connected hidden layer in two important aspects,
however. First, each convolutional unit receives input only from a local area of the input. This
means that each unit represents some features of a local region of the input. Second, the units of
the convolution ply can themselves be organized into a number of feature maps, where all units
in the same feature map share the same weights but receive input from different locations of the
lower layer.
C.Pooling Ply:
As shown in Fig. 2, a pooling operation is applied to the convolution ply to generate its
corresponding pooling ply. The pooling ply is also organized into feature maps, and it has the
same number of feature maps as the number of feature maps in its convolution ply, but each map
is smaller. The purpose of the pooling ply is to reduce the resolution of feature maps. This means
that the units of this ply will serve as generalizations over the features of the lower convolution
ply, and, because these generalizations will again be spatially localized in frequency, they will
also be invariant to small variations in location. This reduction is achieved by applying a pooling

function to several units in a local region of a size determined by a parameter called pooling size.
It is usually a simple function such as maximization or averaging. The pooling function is
applied to each convolution feature map independently. When the max-pooling function is used,
the pooling ply is defined as:
where G is the pooling size , and s, the shift size, determines the overlap of adjacent pooling
windows.
Fig 2:An illustration of the regular CNN
D.Learning Weights in the CNN:

All weights in the convolution ply can be learned using the same error back-propagation
algorithm but some special modifications are needed to take care of sparse connections and
weight sharing. In order to illustrate the learning algorithm for CNN layers, let us first represent
the convolution operation in eq.
where Oi represents the i-th input feature map and Wi,j represents each local weight matrix,
flipped to adhere to the convolution operation’s definition in the same mathematical form as the
fully connected ANN layer so that the same learning algorithm in Section II can be similarly
applied.
When one-dimensional feature maps are used, the convolution operations in eq. can be
represented as a simple matrix multiplication by introducing a large sparse weight matrix W as
shown in Fig. 4, which is formed by replicating a basic weight matrix W as in Fig. 4. The basic
matrix W is constructed from all of the local weight matrices,Wi,j , as follows:

where W is organized as I . F rows, where again F denotes filter size, each band contains I rows
for I input feature maps, and W has J columns representing the weights of J feature maps in the
convolution ply.
Fig. 4. All convolution operations in each convolution ply can be equivalently represented
as one large matrix multiplication involving a sparse weight matrix, where both local
connectivity and weight sharing can be represented in the structure of this sparse weight
matrix. This figure assumes a filter size of 5, 45 input feature maps and 80 feature maps in
the convolution pl.
3.2 Flowchart Diagram:

Speech recognition is mainly done in two stages named as training and testing. But before these,
some basic techniques that are necessary are applied to these speech signals.
In this process the voice of different persons is recorded by a good quality microphone in such
an environment where no noise is present. These speech signals are then pre- processed by using
suitable techniques like filtering, entropy based end point detection and Mel Frequency Cestrum
Coefficient etc. this type of technique makes the speech signal smoother and helps us in
extracting only the required signal that is free of noise.
Samples are recorded with a microphone. Besides speech signals, they contain a lots of distortion
and noise because of the quality of microphone. First of all low and high frequency noise is
eliminated by performing some digital filtering. The situation of speech signals is mainly
between 300Hz to 750Hz.Identical waveforms never produced by recorded samples and the
background noise, length and amplitude may vary. 80 samples are applied with sampling rate 11
KHz, this makes possible to represent all speech signals.
Here, the Classification is done by using Convolution Neural Network and gives the output
which is most probable.

Fig: Block diagram of speech recognition process
3.3 System Design
The purposed Voice recognition system has been divided into two modules.
First module: Feature extraction:

The waveform of the speech signal is as shown in the Fig.2
Fig.2.Speech Signal

Silence has been removed from the signal with the help of the zero crossing
rate and energy vector. Two energy threshold i.e. lower & upper thresholds are
calculated. If the energy level of the signal is beyond or less than the max
or min threshold that signal is considered as noise or silence and hence
removed .The required signal obtained is known as utterance as shown below in
the Fig.3.
Fig. 3. Utterance
The Utterance is divided into number of frames and then passes through a
discrete filter. In the Fig.4 a frame and its output obtained after passing it
through discrete filter has been shown.
Fig4. Framing and Filtering

Now this filtered signal is passed through the hamming window and then to
convert this time domain signal into frequency domain its 400 point FFT has
been found as shown in the Fig.5.
Fig. 5.Windowing and its FFT
Further this signal is passed through mel bank having 24 filters, length of
the FFT is 512, sampling frequency used is 16000hz and then Sparse matrix
containing the filter bank amplitudes is calculated and with its help spectrum
as shown in Fig.6 is obtained which is the highest and lowest filters taper
down to zero.
Fig.6. Mel bank processing and Spectrum obtained.
So it will transform the analog signal to digital signal

Fig:Convertion of Analog signal to Digital Signal
We now have an array of numbers as shown in fig with each number representing the sound
wave’s amplitude at second intervals.
Fig: Conversion of signal into array of numbers
Second module: Feature matching

We did feature matching using Convolution Neural Networks.
The output what we got from the MFCC is given to the CNN so that it will analyze the data and
calculate the weights and train the data to predict the Text to be obtained.

Convolutional Neural Network can do a lot of good things if they are fed with a bunch of
signals for instance to learn some basic signals such as frequency changes, amplitude
changes. Since, they are multi neural networks, the first layer is fed with this information.
The second layer is fed with some recognizable features.
The convolutional neural network match the parts of the signal instead of considering the
whole signal of pixels as it becomes difficult for a computer to identify the signal when the
whole set of pixels are considered [8][9]. The mathematics behind matching these is filtering.
The way this is done is by considering the feature that is lined up with this patch signal and
then one by one pixels are compared and multiplied by each other and then add it up and divide
it with the total number of pixels. This step is repeated for all the pixels that is
considered. The act of convolving signals with a bunch of filters, a bunch of features which
creates a stack of filtered images is called as convolutional layer. It is a layer because it is
operating based on stack that is in convolution one signal becomes a stack of filtered signals. We
get a lot of filtered signals because of the presence of the filters. Convolution layer is one
part.
The next big part is called as pooling that is how a signal stack can be compressed. This is done
by considering a small window pixel which might be a 2 by 2 window pixel or 3 by 3. On
considering a 2 by 2 window pixel and pass it in strides across the filtered signals, from each
window the maximum value is considered.
The standard feed-forward fully connected Neural network (NN) is a computational model
composed of several layers. An input to a particular unit is outputs of all the units in the previous
layer (or input data for the first layer). The unit output is a single linear regression, to which
output value a specific activation function is applied. Convolutional neural network (CNN) is a
type of NN where the input variables are related spatially to each other .
To take into account very important spatial positions, CNNs were developed. Not only they
are able to detect general spatial dependencies, but also are capable of specific patterns
recognition. Shared weights, representing different patterns, improve the convergence by
reducing significantly the number of parameters. CNN recognize small patterns at each
layer, generalizing them (detecting higher order, more complex patterns) in subsequent
layers. This allows detection of various patterns and keeps the number of weights to be learnt
very low.

3.4 Algorithm and Pseudo Code

For Conversion of Speech to Text we are using two algorithms MFCC and Convolutional Neural
Networks, which are explained above.
3.4.1 Training the Audio data:
 Firstly, we have to import required libraries.
Here librosa is for MFCC to process the audio.
 Then we have to import our dataset.
 Then we have to split dataset into train and test.
 After splitting data we will give our data to MFCC
 The output of MFCC is given to Convolution Neural Network as follows.

Here we are using only 4 cnn layers, we an as many as we want to improve accuracy
based on the system we are working on.
 To minimize the error we are using “Adam” optimizer and for measuring we are using
accuracy as metrics
3.5 Testing Process:

To test the Speech Recognizer we import certain libraries.
 All of the magic in SpeechRecognition happens with the Recognizer class.
The primary purpose of a Recognizer instance is, of course, to recognize speech. Each
instance comes with a variety of settings and functionality for recognizing speech from
an audio source.
 Now, instead of using an audio file as the source, you will use the default system
microphone. You can access this by creating an instance of the Microphone class.

If your system has no default microphone (such as on a RaspberryPi), or you want to use
a microphone other than the default, you will need to specify which one to use by
supplying a device index. You can get a list of microphone names by calling
the list_microphone_names() static method of the Microphone class.
 Using listen() to Capture Microphone Input.

Now that you’ve got a Microphone instance ready to go, it’s time to capture some input.
Just like the AudioFile class, Microphone is a context manager. You can capture input
from the microphone using the listen() method of the Recognizer class inside of
the with block. This method takes an audio source as its first argument and records input
from the source until silence is detected.
 To handle ambient noise, you’ll need to use the adjust_for_ambient_noise() method of

the Recognizer class, just like you did when trying to make sense of the noisy audio file.
Since input from a microphone is far less predictable than input from an audio file, it is a
good idea to do this anytime you listen for microphone input
 Finally to predict the output

4.RESULTS
After running the code if you speak something , it predict as shown in fig.
Fig:Results

5.CONCLUSION
Voice recognition is computer analysis of the human voice, particularly for the target of
translating words and phrases and routinely identifying who is speaking on the foundation of
individual information incorporated in speech waves
We discussed two modules used for voice recognition system which are important
in improving its performance. First module provides the information that how
to extract MFCC coefficients from the voice signal and second module endow
with the algorithm that how to compare or match them with the already fed
user’s voice features using Convolution Neural Network .
The performance of the neural networks is being impacted largely by the pre- processing
technique. On the other hand, it is observed that Mel Frequency Cestrum Coefficients are very
reliable tool for the pre-processing stage. Very good results are provided by these coefficients.

REFERENCES
[1] Jingdong Chen, Member, Yiteng (Arden) Huang, Qi Li, Kuldip K. Paliwal, “Recognition of Noisy Speech
using Dynamic Spectral Subband Centroids” IEEE SSIGNAL PROCESSING LETTERS, Vol. 11, Number 2,
February 2004.
[2] Hakan Erdogan, Ruhi Sarikaya, Yuqing Gao, “Using semantic analysis to improve speech recognition
performance” Computer Speech and Language, ELSEVIER 2005.
[3] Chadawan Ittichaichareon, Patiyuth Pramkeaw, “Improving MFCC-based Speech Classification with FIR
Filter” International Conference on Computer Graphics, Simulation and Modelling (ICGSM‟2012) July 28-29,
2012 Pattaya(Thailand).
[4] Bhupinder Singh, Neha Kapur, Puneet Kaur “Sppech Recognition with Hidden Markov Model:A Review”
International Journal of Advanced Research in Computer and Software Engineering, Vol. 2, Issue 3, March
2012.
[5] Shivanker Dev Dhingra, Geeta Nijhawan, Poonam Pandit, “Isolated Speech Recognition using MFCC and
DTW” International Journal of Advance Research in Electrical, Electronics and Instrumentation Engineering,
Vol.2, Issue 8, August 2013.
[6] Ibrahim Patel, Dr. Y. Srinivas Rao, “Speech Recognition using HMM with MFCC-an analysis using
Frequency Spectral Decomposition Technique” Signal and Image Processing:An International Journal(SIPIJ),
Vol.1, Number.2, December 2010.
[7] Om Prakash Prabhakar, Navneet Kumar Sahu,”A Survey on Voice Command Recognition Technique”
International Journal of Advanced Research in Computer and Software Engineering, Vol 3,Issue 5,May 2013.
[8] M A Anusuya, “Speech recognition by Machine”, International Journal of Computer Science and
Information security, Vol. 6, number 3,2009.
[9] Sikha Gupta, Jafreezal Jaafar, Wan Fatimah wan Ahmad, Arpit Bansal, “Feature Extraction Using MFCC”
Signal & Image Processing:An International Journal, Vol 4, No. 4, August 2013.
[10] Kavita Sharma, Prateek Hakar “Speech Denoising Using Different Types of Filters” International journal of
Engineering Research and Applications Vol. 2, Issue 1, Jan-Feb 2012
[11] Elhan S. Salam, Reda A. El-Khoribi, Mahmoud E. Shoman, “Audio Visual Speech Recognition For People with Speech
Disorder” volume.96, No.2, June 2014
[12] R. B. Shinde, Dr. V. P. Pawar, “Vowel Classification Based on LPC and ANN” IJCA, volume.50, No.6, July 2012.

Speech Recognition Using Neural Networks: A. Types of Speech Utterance

Uploaded by

Copyright:

Available Formats

Speech Recognition Using Neural Networks: A. Types of Speech Utterance

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Speech Recognition Using Neural Networks: A. Types of Speech Utterance

Uploaded by

Copyright:

Available Formats

SPEECH RECOGNITION USING NEURAL NETWORKS

DEPT OF COMPUTER SCIENCE Page 1

DEPT OF COMPUTER SCIENCE Page 2

Fig: Block diagram of Speaker Identification System

1.3 Hardware specification:

DEPT OF COMPUTER SCIENCE Page 3

DEPT OF COMPUTER SCIENCE Page 4

Fig: Block diagram for speech recognition using HMM

2.2 Proposed System:

DEPT OF COMPUTER SCIENCE Page 5

The value of 𝑎𝑝𝑟𝑒 is usually

Fig: Block Diagram of Automatic Speech Recognition System

DEPT OF COMPUTER SCIENCE Page 6

3. SYSTEM ANALYSIS AND DESIGN

3.1.1 MFCC as a voice recognition algorithm:

Fig:Block diagram for obtaining MFC coefficients

DEPT OF COMPUTER SCIENCE Page 7

Fast fourier transform:

Mel filter bank processing:

DEPT OF COMPUTER SCIENCE Page 8

3.1.2 Convolution Neural Network:

Fig:Convolution Neural Network

A Organization of the Input Data to the CNN:

DEPT OF COMPUTER SCIENCE Page 9

DEPT OF COMPUTER SCIENCE Page 10

DEPT OF COMPUTER SCIENCE Page 11

Fig 2:An illustration of the regular CNN

D.Learning Weights in the CNN:

DEPT OF COMPUTER SCIENCE Page 12

3.2 Flowchart Diagram:

DEPT OF COMPUTER SCIENCE Page 13

Fig: Block diagram of speech recognition process

3.3 System Design

First module: Feature extraction:

DEPT OF COMPUTER SCIENCE Page 14

Fig4. Framing and Filtering

DEPT OF COMPUTER SCIENCE Page 15

Fig. 5.Windowing and its FFT

Fig.6. Mel bank processing and Spectrum obtained.

So it will transform the analog signal to digital signal

DEPT OF COMPUTER SCIENCE Page 16

Fig:Convertion of Analog signal to Digital Signal

Fig: Conversion of signal into array of numbers

Second module: Feature matching

DEPT OF COMPUTER SCIENCE Page 17

DEPT OF COMPUTER SCIENCE Page 18

3.4 Algorithm and Pseudo Code

 Firstly, we have to import required libraries.

Here librosa is for MFCC to process the audio.

 Then we have to import our dataset.

 Then we have to split dataset into train and test.

 After splitting data we will give our data to MFCC

 The output of MFCC is given to Convolution Neural Network as follows.

DEPT OF COMPUTER SCIENCE Page 19

3.5 Testing Process:

 All of the magic in SpeechRecognition happens with the Recognizer class.

DEPT OF COMPUTER SCIENCE Page 20

 Using listen() to Capture Microphone Input.