Speech Recognition
Voice is the original “user interface.” Most of us grew up learning how to communicate
this way, making it the preferred method of communication because it is what we are
most comfortable with. Now, we are finally circling back to voice after years of emails,
IMs, texting, type-to-search, etc. for all of modern technology’s advancements, voice-
control has been a rather unsophisticated affair. What supposedly aims at simplifying
our lives instead has historically been frustratingly clunky and nothing more than a
Speech recognition is defined as the ability of a machine or program to identify words
and phrases in spoken language and convert them to a machine-readable format.
computer analysis of the human voice, especially for the purposes of interpreting words
and phrases or identifying an individual voice.the main purpose of this is used to operate
a device, perform commands, or write without having to use a keyboard, mouse, or
press any buttons.
Speech recognition is a technology that allows spoken input into systems. You talk to
your computer, phone or device and it uses what you said as input to trigger some
action. The technology is being used to replace other methods of input like typing,
clicking or selecting in other ways. It is a means to make devices and software more
user-friendly and to increase productivity.
Certain smart phones are making There are plenty of applications and areas where
speech recognition is used, including the military, as an aid for impaired persons
(imagine a person with crippled or no hands or fingers), in the medical field,
in robotics etc. In the near future, nearly everyone will be exposed to speech recognition
due to its propagation among common devices like computers and mobile phones.
interesting use of speech recognition. The iPhone and Android devices are examples of
that. Through them, you can initiate a call to a contact by just getting spoken
instructions like "Call office." Other commands may also be entertained, like "Switch
on Bluetooth”.
There has been an exponential growth in voice recognition technology over the past five
decades. Dating back to 1976, computers could only understand slightly more than 1,000
words. That total jumped to roughly 20,000 in the 1980s as IBM continued to develop voice
recognition technology. The first speaker recognition product for consumers was launched in
1990 by Dragon, called Dragon Dictate. In 1996, IBM introduced the first voice recognition
product that could recognize continuous speech.
The first ever attempt at speech recognition technology was, astoundingly, from around
the year 1000 A.D. A man named Pope Sylvester II invented an instrument with
“magic” that could supposedly answer “yes” or “no” questions. Although the details of
his invention have yet to be discovered, Pope Sylvester II would never have guessed
that 944 years later, we were still captivated by the wonders of a similar such technology
– the Magic 8 Ball.
The first ‘official’ example of our modern speech recognition technology was
“Audrey”, a system designed by Bell Laboratories in the 1950s. A trailblazer in this
field, “Audrey” was able to recognize only 9 digits spoken by a single voice (numbers
Originally, the creation of a functional voice recognition system was pursued so that
secretaries would have less of a burden while taking dictation. So, while “Audrey” was
an important first step, it did little to assist in dictation. The next real advancement took
12 years to develop.
Premiering at the World’s Fair in 1962, IBM’s “Shoebox” was able to recognize and
differentiate between 16 words. Up to this point, speech recognition was still laborious.
The earlier systems were set up to recognize and process bits of sound (‘phonemes’).
IBM engineers programmed the machines to use the sound and pitch of each phoneme
as a ‘clue’ to determine what word was being said. Then, the system would try to match
the sound as closely as it could to the preprogramed tonal information it had. The
technology was, at the time, quite advanced for what it was. However, users had to
make pauses and speak slowly to ensure the machine would actually pick up what was
being said.
After another nine years, the Department of Defense began to recognize the value of
speech recognition technology. The ability for a computer to process natural human
language could prove invaluable in any number of areas in the military and national
defense. So, they invested five years into DARPA’s Speech Understanding Research
program; one of the largest programs of its kind in the history of speech recognition.
One of the more prominent inventions to come from this research program was called
“Harpy”, a system that was able to recognize over 1000 words; the vocabulary of an
average toddler. “Harpy” was the most advanced speech recognition software to date.
In the late 1970s and 1980s, speech recognition systems started to become so ubiquitous
that they were making their way into children’s toys. In 1978, the Speak & Spell, using
a speech chip, was introduced to help children spell out words. The speech chip within
would prove to be an important tool for the next phase in speech recognition software.
In 1987, the World of Wonders “Julie” doll came out. In an impressive (if not downright
terrifying) display, Julie was able to respond to a speaker and had the capacity to
distinguish between speaker’s voices.
The ability to distinguish between speakers was not the only advancement made during
this time. More and more scientists were abandoning the notion that speech recognition
had to be acoustically based. Instead, they moved more towards a linguistic approach.
Instead of just using sounds, scientists turned to algorithms to program systems with
the rules of the English language. So, if you were speaking to a system that had trouble
recognizing a word you said, it would be able to give an educated guess by assessing
its options against correct syntactic, semantic, and tonal rules.
Three short years after Julie, the world was introduced to Dragon, debuting its first
speech recognition system, the “Dragon Dictate”. Around the same time, AT&T was
playing with over-the-phone speech recognition software to help field their customer
service calls.
In 1997, Dragon released “Naturally Speaking,” which allowed for natural speech to be
processed without the need for pauses. What started out as a painfully simple and often
inaccurate system is now easy for customers to use. After the launch of smartphones in
the second half of the 2000s, Google launched its Voice Search app for the iPhone.
Three years later, Apple introduced Siri, which is now a prominent
During this past decade, several other technology leaders have also developed more
sophisticated voice recognition software, with Amazon's Echo featuring Alexa and
Microsoft's Cortana -- both of which act as personal assistants that respond to voice
One year later in 2011, Apple debuted ‘Siri’. ‘She’ became instantly famous for her
incredible ability to accurately process natural utterances. And, for her ability to
respond using conversational – and often shockingly sassy – language. You’re sure to
have seen a few screen-captures of her pre-programmed humour floating around the
internet. Her success, boosted by zealous Apple fans, brought speech recognition
technology to the forefront of innovation and technology. With the ability to respond
using natural language and to ‘learn’ using cloud-based processing, Siri catalysed the
birth of other likeminded technologies such as Amazon’s Alexa and Microsoft’s
At present there are many voice assistant devices such as google home, Amazons Alexa
and many more which does many works such as: watch or listen to media, Control TVs
and speakers, Plan your day, Manage tasks and control your home
The figure shows a block diagram of a typical integrated continuous speech recognition
system. Interestingly enough, this generic block diagram can be made to work on
virtually any speech recognition task that has been devised in the past 40 years, i.e.
isolated word recognition, connected word recognition, continuous speech recognition,
etc. The feature analysis module provides the acoustic feature vectors used to
characterize the spectral properties of the time-varying speech signal. The word level
acoustic match module evaluates the similarity between the input feature vector
sequence (corresponding to a portion of the input speech) and a set of acoustic word
models for all words in the recognition task vocabulary to determine which words were
most likely spoken.
The sentence-level match module uses a language model (i.e., a model of syntax
and semantics) to determine the most likely sequence of words. Syntactic and semantic
rules can be specified, either manually, based on task constraints, or with statistical
models such as word and class N-gram probabilities. Search and recognition decisions
are made by 502 considering all likely word sequences and choosing the one with the
best matching score as the recognized sentence.
There are a lot of perplexing elements involved in the overall voice recognition process.
However, for easier comprehension, we have jotted down the key aspects of speech
recognition processes.
1. The first step is to convert the analogue signals into digital signals. When you speak,
you create vibrations which are gathered by an ADC converter. It turns those signals
into digital data for further measurement.
2. It further removes noise from the digitized version. Meanwhile, the volume level is
also made constant.
3. These signals are then separated into minute segments. These segments may be a few
thousandth of a second.
4. These segments are then matched with the phoneme which further combine in order
to form meaningful expressions. Afterward, the software examines these phonemes in
addition to those used around it. A complex statistical model is used to compare these
phonemes using a massive vocab library. This allows the software to interpret what the
speaker wanted to say.
Next the signal is divided into small segments as short as a few hundredths of a second,
or even thousandths in the case of plosive consonant sounds -- consonant stops
produced by obstructing airflow in the vocal tract -- like "p" or "t." The program then
matches these segments to known phonemes in the appropriate language. A phoneme
is the smallest element of a language -- a representation of the sounds we make and put
together to form meaningful expressions. There are roughly 40 phonemes in the English
language (different linguists have different opinions on the exact number), while other
languages have more or fewer phonemes. The next step seems simple, but it is actually
the most difficult to accomplish and is the is focus of most speech recognition research.
The program examines phonemes in the context of the other phonemes around them. It
runs the contextual phoneme plot through a complex statistical model and compares
them to a large library of known words, phrases and sentences. The program then
determines what the user was probably saying and either outputs it as text or issues a
computer command.
This is the simplified version of a perplexing process. It takes less than a second for the
entire process to be complete. Before a computer can even understand what you mean,
it needs to be able to understand what you said.
This involves a complex process that includes audio sampling, feature extraction and
then actual speech recognition to recognize individual sounds and convert them to text.
Researchers have developed techniques that extract features in a similar way to the
human ear and recognise them as phonemes and sounds that human beings make as part
of their speech. This involves the use of artificial neural networks, hidden Markov
models and other ideas that are all part of the broad field of artificial intelligence.
Through these models, speech-recognition rates have improved. Error rates of less than
8% were reported by Google.
But even with these advancements, auditory recognition is only half the battle. Once a
computer has gone through this process, it only has the text that replicates what you
said. But you could have said anything at all.
The next step is natural language processing. Once a machine has converted what you
say into text, it then has to understand what you've actually said. This process is called
"natural language processing". This is arguably more difficult than the process of voice
recognition, because the human language is full of context and semantics that make the
process of natural language recognition difficult.
Anybody who has used earlier voice-recognition systems can testify as to how difficult
this can be. Early systems had a very limited vocabulary and you were required to say
commands in just the right way to ensure that the computer understood them.
This was true not only for voice-recognition systems, but even textual input systems,
where the order of the words and the inclusion of certain words made a large difference
to how the system processed the command. This was because early language-
processing systems used hard rules and decision trees to interpret commands, so any
deviation from these commands caused problems.
NLP is a way for computers to analyze, understand, and derive meaning from human
language in a smart and useful way. By utilizing NLP, developers can organize and
structure knowledge to perform tasks such as automatic summarization, translation,
named entity recognition, relationship extraction, sentiment analysis, speech
recognition, and topic segmentation.
“Apart from common word processor operations that treat text like a mere sequence of
symbols, NLP considers the hierarchical structure of language: several words make a
phrase, several phrases make a sentence By analyzing language for its meaning, NLP
systems have long filled useful roles, such as correcting grammar, converting speech to
text and automatically translating between languages.”
NLP is used to analyze text, allowing machines to understand how human’s speak. This
human-computer interaction enables real-world applications like automatic text
summarization, sentiment analysis, topic extraction, named entity recognition, parts-
of-speech tagging, relationship extraction, stemming, and more. NLP is commonly
used for text mining, machine translation, and automated question answering.
language being one of the easiest things for humans to learn, the ambiguity of language
is what makes natural language processing a difficult problem for computers to master.
NLP algorithms are typically based on machine learning algorithms. Instead of hand-
coding large sets of rules, NLP can rely on machine learning to automatically learn
these rules by analyzing a set of examples (i.e. a large corpus, like a book, down to a
collection of sentences), and making a statical inference. In general, the more data
analyzed, the more accurate the model will be.
This is why it's possible to ask Siri either to "schedule a calendar appointment for 9am
to pick up my dry-cleaning" or "enter pick up my dry-cleaning in my calendar for 9am"
and get the same result.
In this model, each phoneme is like a link in a chain, and the completed chain is a word.
However, the chain branches off in different directions as the program attempts to
match the digital sound with the phoneme that's most likely to come next. During this
process, the program assigns a probability score to each phoneme, based on its built-in
dictionary and user training.
This process is even more complicated for phrases and sentences -- the system has to
figure out where each word stops and starts. The classic example is the phrase
"recognize speech," which sounds a lot like "wreck a nice beach" when you say it very
quickly. The program has to analyze the phonemes using the phrase that came before it
in order to get it right. Here's a breakdown of the two phrases:
r eh k ao g n ay z s p iy ch
"recognize speech"
r eh k ay n ay s b iy ch
Speech recognition consists of two main modules, feature extraction and feature
matching. The purpose of feature extraction module is to convert speech waveform to
some type of representation for further analysis and processing, this extracted
information is known as feature vector. The process of converting voice signal to
feature vector is done by signal-processing front end module. hidden Markov model
HMM refers to the internal state of this Markov model is not visible to the outside
world, the outside world can only see the output value of each moment. The acoustic
characteristic of the speech recognition system, the output value is usually calculated
from the respective frames. HMM portrayed speech signal the need to make two
assumptions, one internal state of the transfer is only related to a previous state, and the
other is that the output value is only relevant to the current state (or the current state of
the transfer), these two assumptions greatly reduced the model complexity.
Frame Blocking and Windowing: The speech signal is divided into a sequence of frames
where each frame can be analyzed independently and represented by a single feature
vector. Since each frame is supposed to have stationary behavior, a compromise, in
order to make the frame blocking, is to use a 20-25 ms window applied at 10 ms
intervals (frame rate of 100 frames/s and overlap between adjacent windows of about
50%), as Holmes & Holmes exposed in 2001. In order to reduce the discontinuities of
the speech signal at the edges of each frame, a tapered window is applied to each one.
The most common used window is amming window
In the process of speech recognition, When the system receives a signal containing
voice, system will detect and locate speech endpoint, removal of excess noise before
and after the speech, Complete voice will be submitted to the next level recognition.
Voice endpoint detection algorithm is mainly based on the energy of the voice,zero
crossing rate, LPC oefficients, information entropy, cepstral, band variance and so on.
After the pre-emphasis and the frame blocking and windowing stage, the MFCC vectors
will be obtained from each speech frame.
A neural network is a system of hardware and/or software patterned after the operation
of neurons in the human brain. Neural networks -- also called artificial
neural networks -- are a variety of deep learning technology, which also falls under the
umbrella of artificial intelligence, or AI.Commercial applications of these technologies
generally focus on solving complex signal processing or pattern recognition problems.
Examples of significant commercial applications since 2000 include handwriting
recognition for check processing, speech-to-text transcription, oil-exploration data
analysis, weather prediction and facial recognition.
Each processing node has its own small sphere of knowledge, including what it has
seen and any rules it was originally programmed with or developed for itself. The tiers
are highly interconnected, which means each node in tier n will be connected to many
nodes in tier n-1-- its inputs -- and in tier n+1, which provides input for those nodes.
There may be one or multiple nodes in the output layer, from which the answer it
produces can be read.
Neural networks are notable for being adaptive, which means they modify themselves
as they learn from initial training and subsequent runs provide more information about
the world. The most basic learning model is centered on
weighting the input streams, which is how each node weights the importance of input
from each of its predecessors. Inputs that contribute to getting right answers are
weighted higher.
The layers are made of nodes. A node is just a place where computation happens,
loosely patterned on a neuron in the human brain, which fires when it encounters
sufficient stimuli. A node combines input from the data with a set of coefficients, or
weights, that either amplify or dampen that input, thereby assigning significance to
inputs with regard to the task the algorithm is trying to learn; e.g. which input is most
helpful is classifying data without error? These input-weight products are summed and
then the sum is passed through a node’s so-called activation function, to determine
whether and to what extent that signal should progress further through the network to
affect the ultimate outcome, say, an act of classification. If the signals passes through,
the neuron has been “activated.”
A node layer is a row of those neuron-like switches that turn on or off as the input is
fed through the net. Each layer’s output is simultaneously the subsequent layer’s input,
starting from an initial input layer receiving your data.
Pairing the model’s adjustable weights with input features is how we assign
significance to those features with regard to how the neural network classifies and
clusters input.
Sound waves are one-dimensional. At every moment in time, they have a single value
based on the height of the wave. For example :Lets sample our “Hello” sound wave
16,000 times per second. Here’s the first 100 samples:
Each number represents the amplitude of the sound wave at 1/16000th of a second
We now have an array of numbers with each number representing the sound wave’s
amplitude at 1/16,000th of a second intervals. We could feed these numbers right into a
neural network. But trying to recognize speech patterns by processing these samples
directly is difficult. Instead, we can make the problem easier by doing some pre-
processing on the audio data.
We now have an array of numbers with each number representing the sound wave’s
amplitude at 1/16,000th of a second intervals.
We could feed these numbers right into a neural network. But trying to recognize speech
patterns by processing these samples directly is difficult. Instead, we can make the
problem easier by doing some pre-processing on the audio data.Let’s start by grouping
our sampled audio into 20-millisecond-long chunks. Here’s our first 20 milliseconds of
audio (i.e., our first 320 samples):
This recording is only 1/50th of a second long. But even this short recording is a
complex mish-mash of different frequencies of sound. There’s some low sounds, some
mid-range sounds, and even some high-pitched sounds sprinkled in. But taken all
together, these different frequencies mix together to make up the complex sound of
human speech.
To make this data easier for a neural network to process, we are going to break apart this
complex sound wave into it’s component parts. We’ll break out the low-pitched parts,
the next-lowest-pitched-parts, and so on
By using fourier transform t breaks apart the complex sound wave into the simple sound
waves that make it up. Once we have those individual sound waves, we add up how
much energy is contained in each one.
The end result is a score of how important each frequency range is, from low pitch (i.e.
bass notes) to high pitch. Each number below represents how much energy was in each
50hz band of our 20 millisecond audio clip:
Now that we have our audio in a format that’s easy to process, we will feed it into a deep
neural network. The input to the neural network will be 20 millisecond audio chunks.
For each little audio slice, it will try to figure out the letter that corresponds the sound
currently being spoken.
We’ll use a recurrent neural network — that is, a neural network that has a memory that
influences future predictions. That’s because each letter it predicts should affect the
likelihood of the next letter it will predict too. For example, if we have said “HEL” so
far, it’s very likely we will say “LO” next to finish out the word “Hello”. It’s much less
likely that we will say omething unpronounceable next like “XYZ”. So having that
memory of previous predictions helps the neural network make more accurate
predictions going forward after we run our entire audio clip through the neural
After we run our entire audio clip through the neural network (one chunk at a time),
we’ll end up with a mapping of each audio chunk to the letters most likely spoken during
that chunk. Here’s what that mapping looks like for me saying “Hello”:
Automatic speech recognition is just one example of speech recognition. Below are
other examples of voice recognition systems.
As voice recognition improves, it is being implemented in more places and its very
likely you have already used it. Below are some good examples of where you might
encounter voice recognition.
Automated phone systems - Many companies today use phone systems that
help direct the caller to the correct department. If you have ever been asked
something like "Say or press number 2 for support" and you say "two," you used
voice recognition.
Google Voice - Google voice is a service that allows you to search and ask
questions on your computer, tablet, and phone.
Digital assistant - Amazon Echo, Apple's Siri, and Google Assistant use voice
recognition to interact with digital assistants that helps answer questions.
Car Bluetooth - For cars with Bluetooth or Handsfree phone pairing, you can
use voice recognition to make commands, such as "call my wife" to make calls
without taking your eyes off the road.
Google Cloud Speech-to-Text enables developers to convert audio to text by applying
powerful neural network models in an easy-to-use API. The API recognizes 120
languages and variants to support your global user base. You can enable voice
command-and-control, transcribe audio from call centers, and more. It can process real-
time streaming or prerecorded audio, using Google’s machine learning technology.
Apply the most advanced deep-learning neural network algorithms to audio for speech
recognition with unparalleled accuracy. Cloud Speech-to-Text accuracy improves over
time as Google improves the internal speech recognition technology Cloud Speech-to-
Text can support your global user base, recognizing 120 languages and variants.
You can also filter inappropriate content in text results for all languages. Cloud Speech-
to-Text can stream text results, immediately returning text as it’s recognized from
Speech recognition technology has only become more accessible to both consumers
and businesses. Here are some of the benefits of speech recognition software for both
consumer and business facing applications.
Consumer Application
Talking to Robots
Controlling Digital Devices
Enable Hand free Technology
Helpful for physically disabled people
Business Application
If businesses have speech processing technology built into their products,
their product can become relevant to a wider group of people. For instance,
many business apps currently require the ability for users to use their hands
to some degree to interact with the application, but this can prove difficult
for certain users with specific disabilities.
Voice recognition allows for ease of communication between people of
different languages. Speaking in a native language and having that
processed by speech recognition software and then translated either audibly
or visually opens up whole new opportunities for ease of communication.
Amazon recently disclosed that it is working on an in-house AI chip that could handle
more queries locally on the device itself instead of sending requests to servers in the
cloud. Then, people could potentially get even speedier responses to some of their
questions.Google is also working to come up with new smart speaker innovations.
It recently made Google Assistant — the voice-powered helper on devices like Android
smartphones and Google Home speakers — able to offer bilingual support, which
requires even more AI processing power from data centers, since the data involved
increases with each additional language.
Additionally, the report suggests smart TVs will show even more growth than smart
speakers. Before long, people may do away with remote controls altogether and give
their televisions voice commands.
ELSA will also introduce new social features to foster community-based learning and
build out more advanced AI capabilities to evaluate other speech elements like rhythm
and intonation for a more complete evaluation of learner's speech.
Steve Rabuchin, VP of Amazon Alexa says, “Our vision is that customers will be able
to access Alexa whenever and wherever they want. That means customers may be able
to talk to their cars, refrigerators, thermostats, lamps and all kinds of devices in and
outside their homes.”
According to Adobe Analytics, 71% of owners of smart speakers like Amazon Echo
and Google Home use voice assistants at least daily, and 44% using them multiple times
a day.
The mass adoption of artificial intelligence in users’ everyday lives is also fueling the
shift towards voice. The number of IoT devices such as smart thermostats and speakers
are giving voice assistants more utility in a connected users life. Smart speakers are the
number one way we are seeing voice being used, however, it only starts there. Many
industry experts even predict that nearly every application will integrate voice
technology in some way in the next 5 years.
voice-enabled apps will start to understand not just what we are saying, but how we are
saying it and the context in which the inquiry is made.
However, there are still a number of barriers that need to be overcome before voice will
see mass adoption. Technological advances are making voice assistants more capable
particularly in AI, natural language processing (NLP), and machine learning. To build
a robust speech recognition experience, the artificial intelligence behind it has to
become better at handling challenges such as accents and background noise. And as
consumers are becoming increasingly more comfortable and reliant upon using voice
to talk to their phones, cars, smart home devices, etc., voice will become a primary
interface to the digital world and with it, expertise for voice interface design and voice
app development will be in greater demand.
