Speech
Speech
Speech
CERTIFICAT E
TABLE OF CONTENTS
Chapter Page
1. PROJECT OVERVIEW........ 1 1.1 Project Objective .....................................................................................................1 1.2 Abstract ...................................................................................................................1 1.3 Project scope ...........................................................................................................2 2. LITERATURE REVIEW....................................................................................... 3 2.1 An overview of Speech Recognition....................................................................... 3 2.2 History..... ..... ...... 3 2.3 Types of speech recognition.................................................................................... 4 2.3.1 Isolated Speech...... `4 2.3.2 Connected Speech... ..... ..4 2.3.3 Continuous Speech ....................................................................................... 4 2.3.4 Spontaneous Speech ......................................................................................5 2.4 Speech Recognition Process.................................................................................... 6 2.4.1 Components of Speech recognition System....... 7 2.5 Uses of Speech Recognition Programs ................................................................... 8 2.6 Applications. .......................................................................................................8 2.6.1 From medical perspective. .........................................................................8 2.6.2 From military perspective. .........................................................................8 2.6.3 From educational perspective..................................................................... 9 2.7 Speech Recognition weakness and flaws ................................................................ 9 2.8 The future of speech recognition........................................................................... 10 2.9 Few Speech recognition softwares ...................................................................... 11 2.9.1 XVoice ...............................................................................................................11 2.9.2 ISIP .....................................................................................................................11 2.9.3 Ears..................................................................................................................... 11 2.9.4 CMU Sphinix .....................................................................................................12 2.9.5 NICO ANN Toolkit............................................................................................12
Chapter
Page
3. METHODOLOGY AND TOOLS ......................................................................13 3.1 Fundamentals to speech recognition ..................................................................... 13 3.1.1 Utterances..................................................................................................... 13 3.1.2 Pronunciation ...............................................................................................13 3.1.3 Grammar....................................................................................................... 13 3.1.4 Accuracy....................................................................................................... 13 3.1.5 Vocabularies .................................................................................................14 3.1.6 Training......................................................................................................... 14 3.2 Tools...................................................................................................................... 14 3.3 Methodology .........................................................................................................15 3.3.1 Speech Synthesis ................................................................................................16 3.3.2 Synthesizer as engine .........................................................................................18 3.3.3 Selecting Voices ................................................................................................19 3.3.4 Speech recognition .............................................................................................20 3.4 Use case diagram................................................................................................... 22 3.5 Activity Diagram ...................................................................................................23
4. IMPLEMENTATION AND TESTING ............................................................. 28 4.1 System Requirements ............................................................................................28 4.1.1 Minimum requirements ................................................................................28 4.1.2 Best requirements .........................................................................................28 4.2 Hardware Requirements ........................................................................................28 4.3 Interfaces ...............................................................................................................30 4.4 Working.................................................................................................................48 4.5 Initial Test, Results and discussion .............................. 49
Chapter
Page
5. CONCLUSION...................................................................................................... 50 5.1 Advantages of software ......... 50 5.2 Disadvantages........................ 50 5.3 Future Enhancements ............ 50 5.4 Conclusion............................................................................................................. 50
REFERENCES ..........................................................................................................52 APPENDICES ...........................................................................................................54 Appendix A Gantt chart .................................. 54 Appendix B Code ...........................................................................................................................55
LIST OF TABLES
Table# Table#4.1
Page# 49
LIST OF FIGURES
FIG#
TOPIC
PAGE#
Fig 2.1 Fig 3.1 Fig 3.2 Fig 3.3 Fig 3.4 Fig 3.5 Fig 3.6 Fig 3.7 Fig 4.1 Fig 4.1 Fig 4.2 Fig 4.3 Fig 4.4 Fig 4.5 Fig 4.6 Fig 4.7 Fig 4.8 Fig 4.9 Fig 4.10 Fig 4.11 Fig 4.12 Fig 4.13 Fig 4.14 Fig 4.15 Fig 4.16 Fig 4.17
2.4 Speech recognition process 3.3.1 Speech Synthesis 3.3 Use Case Diagram 3.2 Activity Diagram (Saving the Document) 3.2 Activity Diagram (Writing Text) 3.2 Activity Diagram (Opening Document) 3.2 Activity Diagram (Clearing Document) 3.2 Activity Diagram (Opening system Software) 4.3 Interfaces (Opening Software) 4.3 Interfaces (Opening Software) 4.3 Interfaces (Running File Menu) 4.3 Interfaces (Saving Document) 4.3 Interfaces (Saving Document) 4.3 Interfaces (Opening Document) 4.3 Interfaces (Opening Document) 4.3 Interfaces (Opening Document) 4.3 Interfaces (Edit Menu) 4.3 Interfaces (Cut Menu) 4.3 Interfaces (Copy Menu) 4.3 Interfaces (Paste Menu) 4.3 Interfaces (Select All) 4.3 Interfaces (Font Size) 4.3 Interfaces (Font Style) 4.3 Interfaces (Running System Commands) 4.3 Interfaces (Text through Voice) 4.3 Interfaces (Verifying Text)
6 16 22 23 24 25 26 27 30 31 32 33 34 35 36 37 38 49 40 41 42 43 44 45 46 47
CHAPTER 1
PROJECT OVERVIEW
This thesis report considers an overview of speech recognition technology, software development, and its applications. The first section deals with the description of speech recognition process, its applications in different sectors, its flaws and finally the future of technology. Later part of report covers the speech recognition process, and the code for the software and its working. Finally the report concludes at the different potentials uses of the application and further improvements and considerations.
1.1
Project Objective
To understand the speech recognition and its fundamentals. Its working and applications in different areas Its implementation as a desktop Application Development for software that can mainly be used for: Speech Recognition Speech Generation Text Editing Tool for operating Machine through voice.
1.2
Abstract
Speech recognition technology is one from the fast growing engineering
technologies. It has a number of applications in different areas and provides potential benefits. Nearly 20% people of the world are suffering from various disabilities; many of them are blind or unable to use their hands effectively. The speech recognition systems in those particular cases provide a significant help to them, so that they can share information with people by operating computer through voice input. This project is designed and developed keeping that factor into mind, and a little effort is made to achieve this aim. Our project is capable to recognize the speech and convert the input audio into text; it also enables a user to perform operations such as 1
save, open, exit a file by providing voice input. It also helps the user to open different system software such as opening Ms-paint, notepad and calculator. At the initial level effort is made to provide help for basic operations as discussed above, but the software can further be updated and enhanced in order to cover more operations.
1.3
Project scope
This project has the speech recognizing and speech synthesizing capabilities
though it is not a complete replacement of what we call a NOTEPAD but still a good text editor to be used through voice. This software also can open windows based softwares such as Notepad, Ms-paint and more.
CHAPTER 2
LITERATURE REVIEW
2.1 An overview of Speech Recognition
Speech recognition is a technology that able a computer to capture the words spoken by a human with a help of microphone [1] [2]. These words are later on recognized by speech recognizer, and in the end, system outputs the recognized words. The process of speech recognition consists of different steps that will be discussed in the following sections one by one. An ideal situation in the process of speech recognition is that, a speech recognition engine recognizes all words uttered by a human but, practically the performance of a speech recognition engine depends on number of factors. Vocabularies, multiple users and noisy environment are the major factors that are counted in as the depending factors for a speech recognition engine [3].
2.2
History
The concept of speech recognition started somewhere in 1940s [3], practically the
first speech recognition program was appeared in 1952 at the bell labs, that was about recognition of a digit in a noise free environment [4], [5]. 1940s and 1950s consider as the foundational period of the speech recognition technology, in this period work was done on the foundational paradigms of the speech recognition that is automation and information theoretic models [15]. In the 1960s we were able to recognize small vocabularies (order of 10-100 words) of isolated words, based on simple acoustic-phonetic properties of speech sounds [3]. The key technologies that were developed during this decade were, filter banks and time normalization methods [15]. In 1970s the medium vocabularies (order of 100-1000 words) using simple template-based, pattern recognition methods were recognized. In 1980s large vocabularies (1000-unlimited) were used and speech recognition problems based on statistical, with a large range of networks for handling 3
language structures were addressed. The key invention of this era were hidden markov model (HMM) and the stochastic language model, which together continuous speech recognition enabled powerful new methods for handling problem efficiently and with high performance [3]. In 1990s the key technologies developed during this period were the methods for stochastic language understanding, statistical learning of acoustic and language models, and the methods for implementation of large vocabulary speech understanding systems. After the five decades of research, the speech recognition technology has finally entered marketplace, benefiting the users in variety of ways. The challenge of designing a machine that truly functions like an intelligent human is still a major one going forward.
2.3
their ability to recognize that words and list of words they have. A few classes of speech recognition are classified as under:
2.4
Digitization
The process of converting the analog signal into a digital form is known as digitization [8], it involves the both sampling and quantization processes. Sampling is converting a continuous signal into discrete signal, while the process of approximating a continuous range of values is known as quantization.
Acoustic Model
An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition engine to recognize speech [8]. The software acoustic model breaks the words into the phonemes [10].
Language Model
Language modeling is used in many natural language processing applications such as speech recognition tries to capture the properties of a language and to predict the next word in the speech sequence [8]. The software language model compares the phonemes to words in its built in dictionary [10].
Speech engine
The job of speech recognition engine is to convert the input audio into text [4]; to accomplish this it uses all sorts of data, software algorithms and statistics. Its first operation is digitization as discussed earlier, that is to convert it into a suitable format for further processing. Once audio signal is in proper format it then searches the best match 7
for it. It does this by considering the words it knows, once the signal is recognized it returns its corresponding text string.
2.5
dictation that is in the context of speech recognition is translation of spoken words into text, and second controlling the computer, that is to develop such software that probably would be capable enough to authorize a user to operate different application by voice [4][11]. Writing by voice let a person to write 150 words per minute or more if indeed he/she can speak that much quickly. This perspective of speech recognition programs create an easy way for composing text and help the people in that industry to compose millions of words digitally in short time rather then writing them one by one, and this way they can save their time and effort. Speech recognition is an alternative of keyboard. If you are unable to write or just dont want to type then programs of speech recognition helps you to do almost any thing that you used to do with keyboard.
2.6
Applications
Telephony
Some Voice Mail systems allow callers to speak commands instead of pressing buttons to send specific tones.
Medical/Disabilities
Many people have difficulty typing due to physical limitations such as repetitive strain injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty hearing could use a system connected to their telephone to convert the caller's speech to text.
2.7
recognition system is unable to be developed. There are number of factors that can reduce the accuracy and performance of a speech recognition program.
Speech recognition process is easy for a human but it is a difficult task for a machine, comparing with a human mind speech recognition programs seems less intelligent, this is due to that fact that a human mind is God gifted thing and the capability of thinking, understanding and reacting is natural, while for a computer program it is a complicated task, first it need to understand the spoken words with respect to their meanings, and it has to create a sufficient balance between the words, noise and spaces. A human has a built in capability of filtering the noise from a speech while a machine requires training, computer requires help for separating the speech sound from the other sounds.
Few factors that are considerable in this regard are [10]: Homonyms: Are the words that are differently spelled and have the different
meaning but acquires the same meaning, for example there their be and bee. This is a challenge for computer machine to distinguish between such types of phrases that sound alike.
Noise factor: the program requires hearing the words uttered by a human
distinctly and clearly. Any extra sound can create interference, first you need to place system away from noisy environments and then speak clearly else the machine will confuse and will mix up the words.
10
Greater use will be made of intelligent systems which will attempt to guess what the speaker intended to say, rather than what was actually said, as people often misspeak and make unintentional mistakes. Microphone and sound systems will be designed to adapt more quickly to changing background noise levels, different environments, with better recognition of extraneous material to be discarded.
2.9.2 ISIP
The Institute for Signal and Information Processing at Mississippi State University has made its speech recognition engine available. The toolkit includes a frontend, a decoder, and a training module. It's a functional toolkit. This software is primarily for developers. The toolkit (and more information about ISIP) is available at: http://www.isip.msstate.edu/project/speech
2.9.3 Ears
Although Ears isn't fully developed, it is a good starting point for programmers wishing to start in ASR. This software is primarily for developers. 11
12
CHAPTER 3
METHODOLOGY AND TOOLS
3.1 Fundamentals to speech recognition
Speech recognition is basically the science of talking with the computer, and having it correctly recognized [17]. To elaborate it we have to understand the following terms [4], [13].
3.1.1 Utterances
When user says some things, then this is an utterance [13] in other words speaking a word or a combination of words that means something to the computer is called an utterance. Utterances are then sent to speech engine to be processed.
3.1.2 Pronunciation
A speech recognition engine uses a process word is its pronunciation, that represents what the speech engine thinks a word should sounds like [4]. Words can have the multiple pronunciations associated with them.
3.1.3 Grammar
Grammar uses particular set of rules in order to define the words and phrases that are going to be recognized by speech engine, more concisely grammar define the domain with which the speech engine works [4]. Grammar can be simple as list of words or flexible enough to support the various degrees of variations.
3.1.4 Accuracy
The performance of the speech recognition system is measurable [4]; the ability of recognizer can be measured by calculating its accuracy. It is useful to identify an utterance.
13
3.1.5 Vocabularies
Vocabularies are the list of words that can be recognized by the speech recognition engine [4]. Generally the smaller vocabularies are easier to identify by a speech recognition engine, while a large listing of words are difficult task to be identified by engine.
3.1.6 Training
Training can be used by the users who have difficulty of speaking or pronouncing certain words, speech recognition systems with training should be able to adapt.
3.2
Tools
1. Smartdraw2000 (For drawing the Gantt chart and Speech Recognition Model) 2. Visual Paradigm for UML 7.1 (for Use case and Activity Diagram) 3. Ms-Paint 4. Notepad 5. Command Prompt 6. Java development kit 1.6 7. Office 2007 (Documentation)
14
3.3 Methodology
As an emerging technology, not all developers are familiar with speech recognition technology. While the basic functions of both speech synthesis and speech recognition takes only few minutes to understand (after all, most people learn to speak and listen by age two), there are subtle and powerful capabilities provided by computerized speech that developers will want to understand and utilize. Despite very substantial investment in speech technology research over the last 40 years, speech synthesis and speech recognition technologies still have significant limitations. Most importantly, speech technology does not always meet the high expectations of users familiar with natural human-to-human speech communication. Understanding the limitations - as well as the strengths - is important for effective use of speech input and output in a user interface and for understanding some of the advanced features of the Java Speech API. An understanding of the capabilities and limitations of speech technology is also important for developers in making decisions about whether a particular application will benefit from the use of speech input and output.
15
Text pre-processing: analyze the input text for special constructs of the
language. In English, special treatment is required for abbreviations, acronyms, dates, times, numbers, currency amounts, email addresses and many other forms. Other languages need special processing for these forms and most languages have other specialized requirements. 16
Allocate and Resume: allocate and resume methods prepare the Synthesizer
to produce speech by allocating all required resources and putting it in the RESUMED state.
Deallocate: The waitEngineState method blocks the caller until the Synthesizer
is in the QUEUE_EMPTY state - until it has finished speaking the text. The deallocate method frees the synthesizer's resources.
Synthesizers are searched, selected and created through the Central class in the
javax.speech package. 18
Synthesizers inherit the basic state system of an engine from the Engine interface. basic engine states are ALLOCATED, DEALLOCATED,
The
ALLOCATING_RESOURCES and DEALLOCATING_RESOURCES for allocation state, and PAUSED and RESUMED for audio output state. The getEngineState method and other methods are inherited for monitoring engine state. An EngineEvent indicates state changes.
package also extends the EngineListener interface as SynthesizerListener to provide events that are specific to synthesizers.
The age of a voice can be AGE_CHILD (up to 12 years), AGE_TEENAGER (13-19), AGE_YOUNGER_ADULT (20-40), AGE_MIDDLE_ADULT (40-60), AGE_OLDER_ADULT (60+), AGE_NEUTRAL, and AGE_DONT_CARE. Both gender and age are OR'able values for both applications and engines. For example, an engine could specify a voice as: Voice("name", GENDER_MALE, AGE_CHILD | AGE_TEENAGER, "style"); In the same way that mode descriptors are used by engines to describe themselves and by applications to select from amongst available engines, the Voice class is used both for description and selection. The match method of Voice allows an application to test whether an engine-provided voice has suitable properties.
ocvolume(java.lang.String dict, java.lang.String folder) constructor to create a speech recognition engine using VQ for recognition Parameters: dict - file path of the dictionary file that contains all the words that the engine can recognize folder - path of the folder where *.vq are located after that we have to make the object of mic input class by using the following method
micInput():
20
Constructor used to create mic object Afterwards by using the following method we can recognize the spoken words
byteArrayComplete():
This method return true if a word is stored in the buffer
removeOldWord():
This method remove the first element in the buffer
newWord():
This method reads the next element in the word buffer
run():
Starts recording from the microphone
stopRecord():
Stops the recording
setContinuous():
sets the recording method to continuous
setDiscrete():
Sets the recording method to discrete
21
22
23
Writing Text
24
Opening Document
25
Clearing Document
26
27
CHAPTER 4
IMPLEMENTATION AND TESTING 4.1 System requirements 4.1.1 Minimum requirements
Pentium 200 MHz processor 64 MB of RAM Microphone Sound card
28
Microphones
A quality microphone is key when utilizing the speech recognition system. Desktop microphones are not suitable to continue with speech recognition system, because they have tendency to pick up more ambient noise. The best choice, and most
common is the headset style. It allows the ambient noise to be minimized, while allowing you to have the microphone at the tip of your tongue all the time. Headsets are available without earphones and with earphones (mono or stereo).
Computer/ Processors
Speech recognition applications can be heavily dependent on processing speed.
This is because a large amount of digital filtering and signal processing can take place in ASR.
29
30
31
38
39
41
42
Font Size
Sets the selected/modified size of the text
43
Font Style
Sets the user specified font style of text
44
After enabling this option the software would be capable to record human speech and convert it into the text and output it in written form based on identification of input speech.
46
47
4.4 Working
This software is designed to recognize the speech and also has the capabilities for speaking and synthesizing means it can convert speech to text and text to speech. This software named SUNTA BOLTA NOTEPAD has the capability to write spoken words into text area of notepad, and also can recognize your commands as save, open, clear this software is capable of opening windows software such as notepad, ms paint, calculator through voice input. The synthesize part of this software helps in verifying the various operations done by user such as read out the written text for user also informing that what type of actions a user is doing such as saving a document, opening a new file or opening a file previously saved on hard disk
48
(II)Voice
While providing voice input to the software it recognized the spoken words in few attempts, this is due to noisy environment, variation in the voice and multiple user factor we properly run the notepad commands through mouse input and they worked properly fine we provide voice input to run the notepad commands and they work fine according to the expectations
(I) Mouse/Keyboard
Properly functioning of open, save and clear to file were observed Properly functioning of open, save and clear to file were observed
(II) Voice
(I)Voice
By providing voice input to the software for running system commands it worked fine and result in expectations, without repeating the commands twice or thrice.
49
CHAPTER 5
CONCLUSION
5.1 Advantages of software
Able to write the text through both keyboard and voice input. Voice recognition of different notepad commands such as open save and clear. Open different windows soft wares, based on voice input. Requires less consumption of time in writing text. Provide significant help for the people with disabilities. Lower operational costs.
5.2 Disadvantages
Low accuracy Not good in the noisy environment
5.4 Conclusion
This Thesis/Project work of speech recognition started with a brief introduction of the technology and its applications in different sectors. The project part of the Report was based on software development for speech recognition. At the later stage we discussed different tools for bringing that idea into practical work. After the 50
development of the software finally it was tested and results were discussed, few deficiencies factors were brought in front. After the testing work, advantages of the software were described and suggestions for further enhancement and improvement were discussed.
51
REFERENCES
BOOKS
[1] Speech recognition- The next revolution 5 edition.
th
[4] "Fundamentals of Speech Recognition". L. Rabiner & B. Juang. 1993. ISBN: 0130151572.
[5] "Speech and Language Processing: An Introduction to Natural Language Processing, Computational
Linguistics and Speech Recognition". D. Jurafsky, J. Martin. 2000. ISBN: 0130950696.
[15] B.H. Juang & Lawrence R. Rabiner, Automatic Speech Recognition A Brief History of the Technology Development 10/08/2004 Source:http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/354_LALI-ASRHistory-final10-8.pdf
[13] Stephen Cook Speech Recognition HOWTO Revision v2.0 April 19, 2002 Source: http://www.scribd.com/doc/2586608/speechrecognitionhowto
52
INTERNET
[3] http://www.abilityhub.com/speech/speech-description.htm
[7] Charu Joshi Speech Recognition Source: http://www.scribd.com/doc/2586608/speechrecognition.pdf Date Added 04/21/2008
th
[10] http://www.jisc.ac.uk/media/documents/techwatch/ruchi.pdf
[11] http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-recognition3.ht
53
APPENDICES
54
73