0% found this document useful (0 votes)
39 views5 pages

Reconocimiento de Voz - MATLAB

This document describes the design of a real-time speech recognition system using MATLAB. It discusses: 1. The system was trained by recording 9 voice samples and extracting MFCC features to distinguish words based on associated energies. 2. The development process involved a training stage of recording, feature extraction and system training, and a testing stage of real-time text conversion. 3. In the training stage, speech samples were recorded and stored as .wav files, then MFCC features were extracted to create a database for the system to train on. Spectrums of the samples were analyzed to separate words.

Uploaded by

Julito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views5 pages

Reconocimiento de Voz - MATLAB

This document describes the design of a real-time speech recognition system using MATLAB. It discusses: 1. The system was trained by recording 9 voice samples and extracting MFCC features to distinguish words based on associated energies. 2. The development process involved a training stage of recording, feature extraction and system training, and a testing stage of real-time text conversion. 3. In the training stage, speech samples were recorded and stored as .wav files, then MFCC features were extracted to create a database for the system to train on. Spectrums of the samples were analyzed to separate words.

Uploaded by

Julito
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Computer Applications (0975 – 8887)

National Conference on Latest Initiatives& Innovations in Communication and Electronics (IICE 2016)

Designing a Real Time Speech Recognition System


using MATLAB

Neha Sharma Shipra Sardana


Student ME-EC Assistant Professor
Department of Electronics & Communications, Department of Electronics & Communications,
Chandigarh University, Mohali, Chandigarh, India Chandigarh University, Mohali, Chandigarh, India

ABSTRACT using such techniques new spectrum is obtained that is


Real time speech to text conversion system introduces different from the previous spectrum of spoken words.
conversion of the uttered words instantly after the utterance.
This paper introduces a unique way of interaction of human
2. DESIGN AND DEVELOPMENT OF
and a computer through the specific way ofnatural language OUR SPEECH RECOGNITION
processing which is basically a speech recognition system. In SYSTEM
this paper nine voice samples were recorded through a The development of our speech recognition system is divided
microphone and the system was trained according to the in two stages, first is training stage and second is the testing
recoded voice samples. MFCC features of speech sample stage.
were calculated and words were distinguished according to
the energies associated with each sampled word. This system Training stage
provides a high accuracy in case of text conversion.
Recording of Extract MFCC Train the
Keywords Speech samples features then store system
Samples time the voice for a
Speech Recognition System period specific
1. INTRODUCTION
Speech recognition is an important application of Natural
Language Processing (NLP). Speech is the most important Calculating Dividing speech Take real
part of communication. We express our ideas through a Energy of samples into frames time
specific language.Computers understand our language (natural Each frame Speech input
language) by speech recognition. Speech or word by word
recognition is the process of automatically extracting and
determining linguistic information conveyed by a speech
wave using computers. Linguistic information, the most Analyzing the Feature extraction Real Time
important information in a speech wave, is called phonetic Frame energy to and classification Text
information. The term speech recognition means the Separate words with the
prerecorded output
recognizing the spoken words only. However, the recognition
system has no idea what those words mean. It only knows that
they are words and what words they are. To be of any use, Figure 2.1: Speech recognition process
these words must be passed on to higher-level software for
syntactic and semantic analysis. It is a technique of pattern 3. TRAINING STAGE
recognition, where acoustic signals are tested and framed into In this stage a database is created by recording some speech
phonetics (number of words, phrases and sentences)[1].To samples by the user. Then the recorded speech samples are
perform such task one needs to record a voice sample and stored into .wav format in Matlab. After this stage it is
then convert this voice sample into .wav format. Spectrum necessary to train the speech recognition system.
based parameters are obtained when a word is recognized.
Near about twenty four parameters can be obtained in the 3.1 Feature Extraction
analysis of spectrum of speech signal. Feature extraction converts the speech waveform into some
parametric information for further analysis and processing.
These parameters are mean, median, standard deviation(STD), This is often referred as the signal-processing front end. The
root mean square(RMS), maximum peak, minimum peak, speech signal is a slowly time varying signal.When examined
slope of the maximum peak, width of maximum peak, signal over a sufficiently short period of time, its characteristics are
to noise ratio, peak frequency, peak amplitude, total power, fairly stationary. However, over long period of time (on the
total harmonic distortion(TDH), total harmonic order of 1/5 seconds or more) the signal characteristic change
distortion(TDH)+noise, inter modulation distortion(IMD) etc. to reflect the different speech sounds being spoken. Therefore,
Various statistical methods are used for the analysis of words short-time spectral analysis is the most common way to
which give some specific value of words. Words fluctuate characterize the speech signal. For that use MFCC features are
between in its bounded range of occurrence. In the used.
improvement of word recognition process, one of the
important tasks is to find the most informative parameters of 3.1.1 Mel Frequency Cepstral Coefficients
speech signal. To perform such tasks some of the techniques (MFCCS)
are used namely, Linear Predicted coding coefficients(LPC) We used MFCC features for this system.The word ‘Mel’ in
and Mel Frequency Cepstrum Coefficients(MFCC)[2]. By the MFCCs represents the melody of a speech signal. MFCC
features are based on the human ear perception which means

1
International Journal of Computer Applications (0975 – 8887)
National Conference on Latest Initiatives& Innovations in Communication and Electronics (IICE 2016)

human’s ear’s critical bandwidth frequencies filters the spaced


linearity between the high frequency and low frequency of the
speech signal and capture the useful information of that
particular signal. The human perception for the frequency
contents of the speech signals follows a nonlinear scale.
That’s why pitch is measured on a scale which is actually a
Mel scale.TheMel-frequency scale is linear frequency spacing
below 1000 Hz and a logarithmic spacing above 1000 Hz [3].
MFCCs are calculated as follows:
Mel (f) = 2595*log10 (1+ f / 700) (1)

3.2 CREATING THE DATABASE


To recognize the uttered word of the speaker, a database is
created to resemble the pronounced word. To create such
database, we first recorded some numerals from one to nine
and achieved following plots:
Figure: 3.2.3: Spectrum of ‘three’

Figure 3.2.4: Spectrum of ‘four’


Figure 3.2.1: spectrum of ‘one’

Figure 3.2.5: Spectrum of ‘five’

Figure 3.2.2: spectrum of ‘two’

2
International Journal of Computer Applications (0975 – 8887)
National Conference on Latest Initiatives& Innovations in Communication and Electronics (IICE 2016)

Figure3.2.9: spectrum of ‘nine’


Figure3.2.6: spectrum of ‘six’
3.3 TRANING OF THE VOICE SAMPLES
Speech recognition system is trained before use. This training
of the speech samples is a necessary part of the speech
recognition system. We have trained our speech samples at
sampling frequency 8khz.
The duration of the training can be varied from 20s.After the
training of the speech samples the system will separate the
frames of speech signal with high energy and the speech
signal with low energy. As the figure:3.3 show the training
sequence of speech samples. The plot of training of voice
samples:

Figure3.2.7: spectrum of ‘seven’

Figure 3.3: Training of the speech samples

4. EXPERIMENTAL TESTING
Our speech recognition system was a speaker dependent
system. So it was dependent on the user’s voice only. In the
training of this system we created a database of nine words.
After the training of this system, a real time speech input was
Figure3.2.8: Spectrum of ‘Eight’ given to it through a good quality microphone. The system
divided the real time speech sample into small segments of
frames or continuous groups of samples. After that the energy
of each frame segment was calculated using simple energy
formula:

Ex= (2)
Energy calculated was then analyzed by a speech detection
algorithm to separate the words.

3
International Journal of Computer Applications (0975 – 8887)
National Conference on Latest Initiatives& Innovations in Communication and Electronics (IICE 2016)

4.1 SPEECH DETECTION ALGORITM human vocal cord and different sounds can have different
The speech detection algorithm was developed by processing frequencies. To predict the different frequencies it power
the prerecorded speech samples frame by frame within a spectral density measure can be a better way. So we find out
simple loop. We divided the whole frame into the segment of the frequencies by power spectral densities measure.
160 samples and each of the samples was detected by the Speech can be termed as short term stationary so MFCC
system. For the detection of each frame we used a features were again extracted and words pronounced by the
combination of signal energy and a zero crossing rate. This user were detected.
calculation became very simple with the MATLAB
mathematical and logical operators. 5. RESULTS
Real time results were obtained in the lab. The user was
4.2 ACOUSTICAL MODEL speaking through the microphone and the text representation
It is very important to create an acoustical model for the was obtained on the computer screen as shown in the figure
detection of each uttered word. So we created an acoustical 5.1. Implementation results of speech to text conversion
model. It is known that different sounds are produced by system are as follows:

Figure 5.1: STT conversion of four and seven

Figure 5.2: STT Conversion of Eight and Nine

4
International Journal of Computer Applications (0975 – 8887)
National Conference on Latest Initiatives& Innovations in Communication and Electronics (IICE 2016)

Figure 5.3: STT Conversion of Seven and Eight

6. CONCLUSION Computer Interaction”InternationalJournal of Computer


In this project nine words were collected and analyzed. Words ApplicationsVolume- 12, pp.0975 – 8887, November
were distinguished by energies associated with them. The 2010.
system was able to separate the words according to their [4] Jeong, S., Hahn, M.: “Speech quality and recognition rate
energies. Final output comes out in the form of text. By using improvement in car noiseenvironments”, Electron.
this code the system can be trained for more words and Lett.,37, (12),pp. 800–802,2001.
paragraphs. Every word parameter has their bounded values in
whichthat parameter varies. Each word has some specific [5] Ma, J., Deng, L.: “Efficient decoding strategies for
range of these parameters. Some words are same but they still conversational speech recognition using a constrained
have some same parameters which tell us about the word. e.g. nonlinear state-space model”, IEEE Trans.Speech Audio
in speech seven and one. Words are similar and have some Process., 11, (6), pp. 590–602, 2003.
parameters which are same. Here the word seven contains one [6] RohitRanchal, , Teresa Taber-Doughty, YirenGuo, Keith
at the end. So it sounds same sometimes, and the system gives Bain,Heather Martin, J. Paul Robinso, and Bradley S.
output ‘one’ when seven is pronounced sometimes. Such type Duerstock, “Using Speech Recognition for Real-Time
of ambiguities can be removed with large number of samples Captioning and Lecture Transcription in the Classroom”,
taken for one particular word. IEEE Transactions On Learning Technologies, Vol. 6,
This system is also very sensitive to noise. In future we can No. 4, October-December 2013.
work for this task. Also this system is very sensitive to word [7] Daryl Ning, “Developing an isolated word recognition
pronunciation during training. Words that we have recorded to system in MATLAB”, The Mathworks, Inc. 2009.
create a database and the words during training should be
pronounced similarly. System is sensitive to tone of [8] Deepak Baby, Tuomas Virtanen, Jort F. Gemmeke, Hugo
pronunciation. van hamme, “Coupled dictionaries for Exampler- Based
Speech Enhancement and Automatic Speech
7. REFERENCES Recognition”, IEEE/ACM Tans. On Audio, speech and
[1] J. D. Tardelli, C. M. Walter, “Speech waveform analysis Language processing, vol. 23, No. 11, 2015.
and recognition process based on non-Euclidean error
minimization and matrix array processing techniques”. [9] Naoki Hirayama, Koichiro Yoshino, KatsutoshiItoyama,
IEEE ICASSP, pp. 1237-1240, 1986. Shinsuke Mori, and Hiroshi G. Okuno, “Automatic
Speech Recognition for Mixed Dialect Utterances by
[2] Takao Suzuki, Yasuo Shoji, “A new speech processing Mixing Dialect Language Models”, IEEE/ACM
scheme for ATM switching systems”. IEEE, Digital transactions on Audio, Speech, And Language
Communications Laboratories, Oki Electric Industry Co. Processing, vol. 23, no. 2, february 2015
Ltd., Japan, pp. 1515-1519, 1989.
[10] Shaila D. Apte, “speech and audio processing”, Wiley
[3] Siva PrasadNandyala, Dr.T.Kishore Kumar,“Real Time India, 2013
Isolated Word Speech Recognition System for Human

IJCATM : www.ijcaonline.org 5

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy