Forensic Speech Recognition
Forensic Speech Recognition
INTRODUCTION
Forensic speaker identification is the application of science to solve the problems related to
identification of the unknown speaker in criminal investigation. A voice is much more than just a
string of words. Although evidence from DNA grabs the headlines, but the fact is that DNA can’t
talk. It can’t be recorded planning, carrying out or confessing to a crime1. The voice of a person
can be successfully used as a biometric feature as it is well accepted by the users and can be easily
recorded using microphones and hardware of low costs2. It can provide an alternative, more secure
means of permitting entry without any need of remembering a password, lock combination etc and
thus, breaking all restrictions of accessing a secured area using keys, magnetic card or any other
fallible device which can be easily stolen. In the present era, widely available facilities of
telephones, mobiles and tape recorders results in the misuse of the device and thus, making them
an efficient tool in commission of criminal offences such as kidnapping, extortion, blackmail
threats, obscene calls, anonymous calls, harassment calls, ransom calls, terrorist calls, match fixing
etc. The criminals has seen the possibility for misuse of the various modes of communication of
voice, believing that he will remain incognito, and thus, nobody would recognize him. It is
fortunately no longer true. The voice can identify him and pin the crime on him3.
Speaker identification is less complicated and leads to a more definite opinion when the
expert has to deal with the normal or ideal voice recognition. The problem arises when the cases
of disguised voice samples, involving both accidental as well as attempted disguise, comes for the
purpose of identification. There is another aspect that makes the achievement of this goal of
speaker identification a bit difficult i.e. the case of almost similar sounding speakers, sharing the
same sex, age and dialect.
SPEECH
Speech is the vocalization form of human communication4. Human beings express their
ideas, thoughts and feelings orally to one another through a series of complex movements that alter
and mold the basic tone created by voice into specific, decodable sounds5. Speech development is
a gradual process that requires years of practice. Communication is a process, a series of events
allowing the speaker to express thoughts and emotions and the listener to understand them. Speech
communication begins as thought that is transformed into language for expression6.
Speech signal is a multidimensional acoustic wave7 (as shown in fig 1), which conveys the
information about the words or message being spoken, identity of the speaker, language spoken,
the presence and type of speech pathologies, the physical and emotional state of the speaker. The
person’s speech also contains the features that may reveal their geographical origin, ethnicity or
race, age, sex, education level and religious orientation and background8, 9, 10. Often, humans are
able to extract the identity information when the speech comes from a speaker they are acquainted
with.
Speech is a compelling biometric for several well known reasons and particularly because it
is the only one available modality in a large set of situations11
By the process of inhalation the air from the environment is drawn into the lungs, stored in the
lungs for a short period of time and finally expelled from the lungs under pressure by the process
of exhalation. During exhalation, air under pressure is sent from the lungs to the larynx. The
function of the larynx, particularly that part known as the vocal folds, is to set the molecules of
this breath stream into vibration13 (as shown in fig 2). For sound to be produced, these molecules
have to vibrate at a rate that falls within a particular range. The process by which molecules of air
are set into vibration is known as phonation.
SPEAKER RECOGNITION
Speaker recognition may be defined as any activity in which a speech sample is attributed to
a person on the basis of its acoustic or perceptual properties15.The information content of a spoken
utterance are speaker characteristics, spoken phrase, emotions, additional noise, channel
transformations etc16 .It can be divided into Speaker Identification and Speaker Verification.
Speaker identification determines which registered speaker provides a given utterance from
amongst a set of known speakers. The unknown speaker is identified as the speaker hose model
best matches the input utterance. Speaker verification accepts or rejects the identity claim of a
speaker – is the speaker the person they say they are17, 18, 19? In speaker recognition, you don’t
make the identification by analysing the language used, by remembering what the speaker looks
like or by any other means. This is sometimes used when a person is not quite sure whether the
process is that of verification or identification20. In a scheme for the mechanical recognition of
the speakers, it is desirable to use acoustic parameters that are closely related to voice
characteristics that distinguish speakers. It involves selection of such parameters which are which
are motivated by known relations between the voice signal and vocal-tract shapes and gestures21.
In speaker recognition we differ between low-level and high-level information. High level-
information is values like a dialect, an accent, the talking style, the subject manner of context,
phonetics, prosodic and lexical information22. These features are currently only recognized and
analyzed by humans. The Low-level features are denoted by the information like fundamental
frequency (F0), formant frequency, pitch, intensity, rhythm, tone, spectral magnitude and
bandwidths of an individual’s voice23. An ideal feature would:
There are different ways to categorize the features. From the viewpoint of their physical
interpretation, we can divide them into:
1. Short-term spectral features -These features, as the name suggests, are computed from the
short frames of about 20 to 30 milliseconds in duration. They are usually the descriptors of
the resonance properties of the subpharyngeal vocal tract.
2. Voice source features -These features characterize the glottal excitation signal of voiced
sounds such as glottal pulse shape and fundamental frequency, and it is reasonable to
assume that they carry speaker-specific information.
3. Spectro-temporal features -It is reasonable to assume that the Spectro temporal Signal
details such as formant transitions and energy modulations contain useful speaker-specific
information.
4. Prosodic features – Prosody refers to non-segmental aspects of speech, including syllable
stress, intonation patterns, speaking rate and rhythm. One important aspect of prosody is
that, unlike the traditional short-term spectral features, it spans over long segments like
syllables, words, and utterances and reflects differences in speaking style, language
background, sentence type and emotion of the speaker.
5. High level features -These features attempt to capture conversation-level characteristics of
speakers, such as characteristic use of words (”uh-huh”, “you know”, “oh yeah”, etc.).
Other features are the dialect of any language used in the conversation by the speaker,
accent of the speaker and the style of speaking.
DISGUISED SPEECH
Any type of alteration, distortion or deviation from the normal speech, irrespective of the cause, is
defined as the speech disguise. Disguise can take many forms, and can be very damaging to both
lay as well as to technical speaker identification25.The criminal often disguises his or her voice.
The effect of the disguise is that, the acoustic features of the criminal exemplar, is altered to
become less similar to the acoustic features of the actual criminal’s undisguised utterances. There
tended to be two types of research. One type was non-electronic and attempted to measure the
ability of non-expert humans to identify other humans who were disguising their voice in a variety
of ways. The second type was electronic, often involving speech spectrograms, or so-called
“voiceprints”26.
The question of voice disguise detection appears as fundamental in forensic applications. Different
kinds of approaches provide significant results of discrimination. A complementary study based
on formant and automatic analysis could be fused to increase the recognition rate.
The second challenge is that the speaker identification essentially is incapable of accurately
determining the identity of a speaker when a test sample of his disguised speech is compared to a
reference based on his normal speaking mode. To date, and to the best of our knowledge, the above
statement remains true. One goal of forensic speaker recognition is to undertake research to reverse
that situation, at least for a large and useful subset of disguise types.
TYPES OF DISGUISE
Disguised speech can be of two types:
• Non- deliberate or accidental disguise- This form of voice disguise involves alterations that
result from some involuntary state of the individual. The cases of accidental disguise
involve the temporary change in person’s speech due to change in physical state like due
to chewing, eating and illness or emotional state of person like stress, anger, fear,
nervousness, cheerfulness, surprise, sadness etc. Research has been done for developing
robust and precise automatic speaker verification system based on these speaker based
variation in features29.
• Deliberate or attempted disguise- The samples of attempted disguise are frequently
encountered in the cases of anonymous calls, ransom calls and threatening calls where the
speaker makes a deliberate effort to change their voice by changing its phonetic, phonemic
and prosodic features, in order to hide their identity due to the fear of being caught.
Various methodologies for approaching the problem of speaker identification have been proposed.
For identification purpose, different well recognised standard techniques will be used for
maintaining the validity of the work done and the choice will be as per the requirement:
In the classical analogue spectrograph a magnetic tape recorder and playback unit is used to
process the sounds into electronic signals. These signals are then sent through a variable electronic
bandpass filter, which selects a frequency band that is to be analysed, before a stylus measures its
energy and records the results on electrical sensitive paper. The paper is mounted on a drum, which
is rotating during playback in order to plot the time variations in the signal. When the whole length
of the speech sample in analysed at a specific frequency band, the band of the filter and the position
of the stylus are correspondingly altered. The tape is then played again in order to analyse a new
part of the frequency spectrum. This process is repeated over again until the entire desired
frequency range is analysed. In each spectrogram, the horizontal dimension is time, the vertical
dimension represents frequency and the darkness represents the intensity on the compression
scale40.The differences in amplitude values are shown in a grey scaling where black represents
the most intense and white the least intense waveform components.
However since 1962, it was considered as a fool- proof method of personal identification, voice
identification by spectrographic analysis, the “voiceprint” technique has been in a legal limbo. But
the recent developments in both science and the law, however, indicate that despite initially
adverse scientific and judicial reaction, spectrographic voice identification is perhaps coming of
legal age41.
3) Computerized approach-
This is a semi automatic approach for recognition of speech samples which involves three stages:
• Feature Extraction
• Feature Comparison
Classification
In this method the parameters of the signals are extracted by means of spectrum analyzer and
recognition is made by means of computer system on the basis of stored data in respect of
controlled samples of the speakers.
However it is observed that the error rates of machines are often more than an order of magnitude
greater than those of humans, as machine performance degrades below that of humans in noise,
with channel variability, and for spontaneous speech42.
• BATVOX 3.0 accepts audio files in the following format: .wav files with linear PCM
coding, sampling frequency 8 KHz, 16-bit resolution and mono.
• Manages audio files of at least 7 seconds of net speech.
• Manages audio files whose signal to noise ratio is more than 10dBs
• The test and the training audio files should possess the voice of the speakers sharing the
same sex, same language and have same channel characteristics
LIMITATIONS OF SPEAKER IDENTIFICATION
The criteria of identification of speech samples using different techniques are discussed as follows:
1. Auditory analysis- In this method, the identification is done on the basis of following voice
characteristics-
o Quality of speech sample- Synthetic speech can be compared and evaluated with respect to
intelligibility, naturalness, and suitability for used application62. Pronunciation, Accent,
Speech sounds like vowels and consonants, plosives, fricatives, nasal and throat sounds and
coupling effect, Grammar, Stress, Syllable stress, Intonation, Rhythm, Fluency, pacing,
Phrasing and Blending63. Each person possesses a unique voice quality which depend on
number of anatomical features, such as, dimension of oral tract, pharynx, nasal cavity, shape
and size of tongue and lips, position of teeth, tissue density etc.
o Linguistic features- Linguistics is the scientific study of natural language. These features
involves, the stylish impression of speech, delivery of speech ( the style in which the speech is
delivered i.e., Manuscript, Memorized, Impromptu, and Extemporaneous64), Phonation (the
process by which the vocal folds produce certain sounds through quasi-periodic vibration or
any oscillatory state of any part of larynx that modifies the airstream, of which voicing is one
example65).
o Articulatory speech- This is a type of speech produced by movement or articulation of the
articulators. This involves, flow of speech (depends upon the fluency of the speaker66), plosive
formation (First, a complete closure of the passage of air at the same point in the vocal tract,
then the removal of the closure, causing a sudden release of the blocked air with some
explosive noise), nasality (Nasal consonants have a continuous full closure at some point in
the oral cavity. Since the velum is set in the low position, opening the velopharyngeal port, air
is let out through the nasal cavity).
o Prosodic analysis- It involves the intonation pattern, dynamic of loudness (dynamics refers to
the volume of a sound or note and loudness is the strength of sensation received through the
ear), speech rate (relative timing of different speech events in spoken utterances), speech
variations, striking time features, pauses (number/length/pattern).
o Voice impairment- Speech or language impairment (SLI) means a communication disorder,
such as stuttering, impaired articulation, language impairment, or a voice impairment, that
adversely affects a person’s educational performance. Speech and language disorders refer to
problems in communication and related areas such as oral motor function. These delays and
disorders range from simple sound substitutions to the inability to understand or use language
or use the oral-motor mechanism for functional speech and feeding. Some causes of speech
and language disorders include hearing loss, neurological disorders, brain injury, mental
retardation, drug abuse, physical impairments such as cleft lip or palate, and vocal abuse or
misuse. Frequently, however, the cause is unknown.
o Temporal measurements- The temporal properties of speech play an important role in linguistic
contrast. Speech can be said to be comprised of three main temporal features based on
dominant fluctuation rates; envelope, periodicity and fine structure. Each feature has distinct
acoustic manifestations, auditory and perceptual correlates and roles in linguistic contrasts67.
These measurements involves phonation-time (P/T) ratio, speech time (S/T) rate, speech burst
(its number/length/patterns).
2. Spectrographic analysis- The spectrograph is an instrument used to analyse the complex
waveforms of sound and their alterations in time. This is done through spectrograms, which
are graphic displays of the amplitude as a function of both frequency and time68. In this
method, the clue words are selected from the questioned and the specimen samples on the
basis of auditory analysis. These are then selected for voice spectrographic analysis. A
trained examiner may be able to give an opinion about the similarity between the two
samples on the basis of characteristics like:
o Fundamental frequency- It is the frequency of vibration of vocal cord produced during the
rapid opening and closing of vocal cord69, (as shown in fig 5). The fundamental frequency of
a periodic signal is an inverse of period length. The period, in turn, is the smallest repeating
unit of a signal70. In voice spectrogram, horizontal distance between vertical striations is an
indication of fundamental frequency. It also includes the pitch of voice i.e., the rate of vibration
of vocal cords.
3. Software, BATVOX 3.0- The working of this software depends upon the following
elements43:-
o Case- It is the repository of audio files, models and calculations part of the same investigation
or forensic case.
o Audio file- this is the first element to enter into the system in order to build the models and
compute some biometric calculations. The audio files in BATVOX can mainly classified in
two types
▪ Test audio: Unknown audio file used to be compared to a suspect model in
order to find it out if both belongs to the same speaker
▪ Training audio: audio file recorded from the known speaker, used to create
a voice model which can be compared with the test audio files.
o Model- A model generated from the audio files is the representation of characteristics of the
speaker’s voice.
o Training of a model- A biometric process which extracts the characteristics of the voice from
the audio samples and thus, generates a model.
o Session- Group of calculations gathered together because of some common aspects according
to the criteria of the user. The calculations included in a session can be identification and a LR
calculation.
o Identification- The objective of the speaker identification is to classify a voice whose origin is
not known.
o Likelihood ratio (LR) – It is a relationship of probabilities. Firstly, we have the likelihood that
the test belongs to a suspect and secondly, the test does not belong to the suspect. One of the
differences between the LR and identification is the way of expressing results.
o Normalization- It is the process of correcting the effects that the lack of alignment has on
statistical scoring. This lack of alignment is caused by the heterogeneous nature of the audio
system.
o Reference population- These types of samples are basically required for the calibration of the
instrument. For a proper selection of the reference population, the characteristics of the
population should match the characteristics of the disputed speaker. These characteristics
include the sex of the speaker, channel type, net spoken length and language75.
REFERENCES
1. Phil Rose & James R Robertson, “Forensic Speaker Identification”, Taylor & Francis,1999
2. MohamedChenafa et al, “Biometric System Based on Voice Recognition Using
Multiclassifiers”, Springer Berlin / Heidelberg, Volume 5372/2008
3. B.R. Sharma ,”Scientific Criminal Investigation”, universal law publishing company
4. Definitions of speech”, (en.wikipedia.org/wiki/Speech)
5. “National Institute on Deafness and other Communication Disorders (NIDCD)”,(
www.nidcd.nih.gov/directory)
6. Dennis C. Tanner & Matthew E. Tann