An Analysis of Phase-Based Speech Features For Tonal Speech Recognition
An Analysis of Phase-Based Speech Features For Tonal Speech Recognition
An Analysis of Phase-Based Speech Features For Tonal Speech Recognition
Speech Recognition
Jyoti Mannala 1*, Bomken Kamdak 1, Utpal Bhattacharjee 1*
1Rajiv Gandhi University, Arunachal Pradesh, Rono Hills, Doimukh, 791112, India
Abstract
Automatic speech recognition (ASR) technologies and systems have made remarkable progress in the last decade. Now-a-
days ASR based systems have been successfully integrated in many commercial applications and they are giving highly
satisfactory results. However, speech recognition technologies as well as the systems are still highly dependent on the
language family for which it is developed and optimized. The language dependency is a major hurdle in the development of
universal speech recognition system that can operate at any language conditions. The language dependencies basically come
from the parameterization of the speech signal itself. Tonal languages are different category of language where the pitch
information distinguishes one morpheme from the others. However, most of the feature extraction techniques for ASR are
optimized for English language where tone related information is completely suppressed. In this paper we have investigated
short-time phase-based Modified Group Delay (MGD) features for parameterization of the speech signal for recognition of
the tonal vowels. The tonal vowels comprises of two categories of vowels – vowels vowels—vowels without any lexical
tone and vowels with lexical tone. Therefore, a feature vector which can recognize the tonal vowels can be considered as a
speech parameterization technique for both tonal as well as non-tonal language recognizer.
Keywords: Feature analysis, MGD feature, Phase-based features, Speech recognition, Tonal language
Introduction
Natural languages are broadly classified into two categories – tonal categories—tonal and non-tonal based on their dependency
on lexical tone. In tonal language, the lexical tone plays an important role in distinguishing the syllables otherwise similar whereas
in non-tonal language the lexical tone has no significant role in distinguishing the syllables. English, Hindi, Assamese are the
example of non-tonal language whereas Chinese, Japanese, language of South East Asia, Sweden, Norway and Sub-Sahara
Africa are tonal languages [1]. Modern speech recognition research has a half century long legacy. The technology and the
systems developed speech recognition have already registered significant progress and many systems are already
commercialized. However, those systems are optimized with non-tonal languages, particularly for English language. As a result,
when these systems are used for tonal speech recognition their performance degrades considerably. Since the large sections of
the world population are speaker of tonal language, for the global acceptability of the speech recognition technology and system,
it must be efficient in recognizing in tonal as well as non-tonal language.
One of the major reasons for the system developed for non-tonal language fail to deliver consistent performance in tonal language
is due to the non-consideration of the lexical tone related information. Lexical tones are produced as a result of excursion of the
fundamental frequency and these informations are discarded in non-tonal speech recognition system as a measure of performance
optimization and due to robustness issues as it contains very little useful information for non-tonal speech recognition system.
In the recent years many attempts have been made for developing tonal speech recognition system [2–4]. Such systems are
developed considering the fact that a tonal syllable has two components – phonetic components—phonetic and tone. The
phonetic component gives information about the base phonetic unit which is similar with non-tonal speech and a tonal unit which
gives information about the tone associated with that phonetic unit. As a result, the tonal speech recognition system relies on two
sets of features—Spectral features – Spectral features like MFCC for base phonetic unit recognition and prosodic features for
associated lexical tone recognition. The scores obtained from both are combined together to arrive at a decision on underlying
syllabic unit. However, the prosodic features are highly sensitive to ambient conditions. As a result, the tonal speech recognition
systems based on prosodic features are highly susceptible to ambient conditions.
The speech recognition system relies on short-term spectral property of the speech signal in order to exploit the short-term
stationary property of the speech signal. To extract the short-term property, Short Term Fourier Transform (STFT) is used.
STFT returns the short-term magnitude and phase spectral of the speech signal. However, in most of the cases magnitude
spectra is retain to extract spectral features like Mel Frequency Cepstral Coefficient (MFCC) and phase spectral is completely
discarded due to the practical difficulty in phase wrapping [5, 6]. However, the recent research has established the importance of
phase spectra in speech processing applications like speech recognition, speaker recognition, emotion recognition and speech
enhancement [7].
In this paper we have analyzed the tonal phoneme discrimination capability of phase-based features. The performances of phase-
based features have been evaluated for tonal phoneme discrimination.
The Fourier transform of a discrete time speech signal x(n) is given by.
\(X\left(
\[ X\left( \omega \right) = \left| {X\left( \omega \right)} \right|{\text{e}}^{j\phi \left( \omega
\right)}\)… (1).
\right)} \] (1)
where \(\left| {X\left( \omega \right)} \right|\) is the magnitude spectra and \(\phi \left( \omega \right)\) is the phase spectra of the
speech signal. There are number of speech processing difficulties in using the phase spectra directly in Automatic Speech
Recognition (ASR). Two most critical problems are – firstly are—firstly a phase spectrum is highly sensitive to the exact
positioning of the short-time analysis window. It has been observed that for a small shift in analysis window, the phase spectrum
changes dramatically [8]. Secondly, the phase spectrum values are only computable within the range \(\pm \pi\), called principal
phase spectrum. The value changes abruptly due to the wrapping effect beyond this range. However, for better representation of
the phase spectra for automatic speech recognition, the spectra must be unwrapped. The major problem with this unwrapping is
that any multiple of \(2\pi\) is added to the phase spectra without changing the value of \(X\left( \omega \right)\). Recent studies
have shown that phase spectrum can be used for speech applications and gives promising results [9, 10]. Among the phase
based features extraction techniques, Group Delay Function (GDF) and All-pole Group Delay Function (APGD) are widely
used. In the present study we have used a modified version of GDF called Modified Group Delay (MGD) function for extracting
the phase based features due to their promising performance in speech recognition [11].
The Group Delay Function is derived by taking the negative derivation of the Fourier phase spectrum \(\phi \left( \omega \right)\),
written as [12, 13]:
\[ \begin{aligned} & \tau \left( \omega \right) = - \frac{{d\left( {\phi \left( \omega \right)} \right)}}{d\left( \omega \right)} \\
& = \frac{{X_{R} \left( \omega \right)Y_{R} \left( \omega \right) + X_{I} \left( \omega \right)Y_{I} \left( \omega (2)
\right)}}{{\left| {X\left( \omega \right)} \right|^{2} }} \\ \end{aligned} \]
the angular frequency \(\omega\) is limited to (0,2 \(\pi\)), \(Y\left( \omega \right)\) is the magnitude of the Fourier transform of the
time-weighted version of the speech signal given by \(y\left( n \right) = nx\left( n \right)\). The subscript R and I denotes the real
and imaginary parts of the signals. The features derived from GDF often leads to an erroneous representation near the point of
discontinuity. It is due to the denominator \(\left| {X\left( \omega \right)} \right|^{2}\) which tends to 0 near the point of
discontinuities. Therefore, the group delay function is modified, which is given as [14]
\[ \tau \left( \omega \right) = \frac{{\tau_{p} \left( \omega \right)}}{{\left| {\tau_{p} \left( \omega \right)} \right|}}\left|
(3)
{\tau_{p} \left( \omega \right)} \right|^{\alpha } \]
where
\[ \tau_{p} \left( \omega \right) = \frac{{X_{R} \left( \omega \right)Y_{R} \left( \omega \right) + X_{I} \left( \omega
(4)
\right)Y_{I} \left( \omega \right)}}{{\left| {S\left( \omega \right)} \right|^{2\gamma } }} \]
where \(S\left( \omega \right)\) is the cepstrally smoothed form of \(\left| {X\left( \omega \right)} \right|\). \(\alpha\) and \(\gamma\)
controls the range dynamics of the modified group delay function. Here,
\[ P\left( \omega \right) = X_{R} \left( \omega \right)Y_{R} \left( \omega \right) + X_{I} \left( \omega \right)Y_{I} \left(
(5)
\omega \right) \]
is called the product spectra of the speech signal which includes both magnitude and phase information [15].
Speech Database
In the present study, we have created a speech database of Apatani Language of Arunachal Pradesh of North East India to
analyze the performance of phase-based features for tonal speech recognition in mismatched environmental conditions. The
Apatani language belongs to the Tani group of language. Tani languages constitute a distinct subgroup within Tibeto-Burman
group of languages [16]. The Tani languages are found basically in the contiguous areas of Arunachal Pradesh. A small number of
Tani speakers are found in the contiguous area of Tibet and only the speakers of Missing language are found in Assam [ 17]. The
Apatani language has 06(six) vowels and 17 (seventeen) consonants [18]. To record the database, 24 phonetically rich isolated
tonal words have been selected. The words are spoken by 20 different speakers (13 males and 7 females). The recording has
been done in a controlled acoustical environment at 16 kHz sampling frequency and 16 bit mono format. A headphone
microphone has been used for recoding the database. The words are selected in such a way that each tonal instance of the vowel
has at least 5 instances among the words. Since the tone associated with the vowel is sufficient to identify the tone associated
with the entire syllable [3, 19], therefore, in the present study we have evaluated the phone discrimination capability and
robustness issue of the phase-based features with reference to their tonal vowel discrimination capability. Each tonal instance of a
vowel has been considered as different tonal vowel. For example, the vowel [a:] [a:] have three associated tones – rising, tones
—rising, falling and level. Thus vowel \([a:\)] [a:] gives raise to the tonal vowels [\({\text{a}}:\)] (\([a:\)] rising), [\({\text{a}}:\)]
(\([a:\)] falling) and [\(\overline{a}\):] ((\([a:\)
([a:] rising),
((\([{\text{a}}:\)] level). Considering the tonal instances as a separate vowel, we get sixteen tonal vowels in Apatani language.
The vowels and their tonal instances are given in Table 1. Since the vowel [ə] has only one tone, it is not taken into consideration
while evaluating the performance of the feature vectors.
All the experiments are carried out using this database. The vowels are segmented from the isolated words for all its tonal
instances. The segmentation has been done using PRAAT software which is followed by subjective verification.
To evaluate the performance of the features for tonal phoneme discrimination capability, both statistical methods and Hidden
Markov Model based recognizer have been used.
Euclidean distances between the feature values extracted from each pair of tonal phoneme have been computed. The Euclidean
distance gives an indication of the linear separation among the tonal vowels with reference to phase-based features. Higher the
value of Euclidean distance indicates better discrimination capability for the feature vector.
Fisher’s Discrimination ration (F-ratio) (F-ratio) [20] has been used as a quantitative measure for the tonal phoneme
discrimination capability of the phonemes. F-ratio F-ratio is defined as:
\[ F = \frac{{{\text{Variance }}\,{\text{of}}\,{\text{the}}\,{\text{tonal}}\,{\text{phoneme
}}\,{\text{mean}}}}{{\begin{array}{*{20}c} {{\text{Average}}\,{\text{intra}} - {\text{phoneme}}\,{\text{variance}}} \\
{{\text{for}}\,{\text{all}}\,{\text{phonemes }}} \\ \end{array} }} \]
where \(\overline{\mu }\) is the average mean for all the tonal phonemes, \(\mu_{i}\) is the average mean for the base phoneme
\(i\), \(\mu_{\beta ,i}\) is the average mean for phoneme \(i\) for tone \(\beta\), \(x_{\beta }^{\left( i \right)}\) indicates an
instance of the phoneme \(i\) for tone \(\beta\). Higher the value of F-ratio indicates that the feature is capable of discriminating
among the tonal phonemes.
To evaluate the performance of the phase-based feature set in recognizing the tonal phonemes, a Left-to-Right Hidden Markov
Model (LRHMM) has been used. The LRHMM is suitable for speech recognition due to its capability to model the time varying
property of the speech signal. The number of HMM states is determined experimentally. In the present model, 6 (six) states have
been used. Each state is represented by a single Gaussian distribution function given by [21]].
\[ P\left( {x{|}\mu ,\sigma^{2} } \right) = \frac{1}{{\sqrt {2\pi \sigma^{2} } }}\exp \left( {\frac{{ - (x - \mu )^{2}
(8)
}}{{2\sigma^{2} }}} \right) \]
where x is the observation vector, \(\mu\) is the Gaussian mean vector and \(\sigma^{2}\) is the variance. The forward–
backward algorithm has been used for training the HMM model. Clean speech signals have been used for training the models.
To extract the short-time MGD features the speech signal is first pre-emphasized with emphasizing factor 0.97 and then framed
by a Hamming windows of 30 ms duration and 10 ms frame rate. The phase-based MGD features are extracted from the
windowed speech signal using the method described in the Sect. 2. 2.
In the first set of experiments we have evaluated the phoneme discrimination capability of the MGD features in the context of
tonal vowel recognition. The feature values are computed from each instance of the tonal vowels. For each tonal vowel, the
average value for each dimension of the feature vector has been computed ignoring the outliers. The Euclidean distances have
been computed between each tonal vowel with all the other tonal vowels and their average has been taken. Table 2 gives the
average Euclidean distances of each tonal vowel from all the other tonal vowels. vowels (Table 3).
From the above experiments it has been observed that phase-based MGD features are suitable in discriminating the tonal vowels.
They possess discrimination ability even when the base phoneme of the tonal vowels is same and distinction among them is due to
underlying tone only or vice versa.
To assess the suitability of the MGD features for tonal vowel recognition, we have computed the F-ratio values for the features.
Higher the value of F-ratio among different groups indicates better discrimination ability of the feature with respect to that
grouping factor. In the present study we have evaluated the computed F-ratio value with grouping factors – same factors—same
base-phoneme, same tone and different base phoneme and tone. The F-ratio values are listed in table Table 4.
From the above experiments, it has been established that short-time phase based feature MGD has the capability to identify the
tonal vowels even when they are distinct from each other only by tone or only by base-phoneme. This observation assets the fact
that short-time phase based MGD feature is a better alternative than the combination of MFCC and Prosodic based features for
tonal vowel recognition which have been evaluated in our earlier works [22].
In the next set of experiments, we have evaluated the performance of MGD feature for their tonal vowel recognition in terms of
recognition accuracy of the HMM based recognizer. The model has been trained using clean speech database. 60% of the tonal
instances of each vowel have been used for training and the remaining 40% for testing the system. The performance of the MGD
features have been evaluated in terms of recognition accuracy, which is the percentage of times the recognizer has been able to
recognize the tonal vowel correctly. The error cases have been further in-depth investigated to get an insight into the confusion
created at modeling level. Table 5 presents an analysis of the performance of HMM based tonal vowel recognition.
From the experiments it has been observed that the short-term phase based MGD feature vector is efficient in representing both
tone variation as well as base-phoneme variation in case of tonal vowels. Only in the case of 6.46% cases the recognizer has
been unable to recognize the tone variation of the same base-phone whereas in 2.91% cases tone takes more dominants over
base-phone for tonal vowel recognition. This facts reassures the suitability of MGD feature for tonal vowel recognition in
particular and language recognition in general.
Conclusion
It this paper we have investigated the performance of MGD features for their tonal vowels discrimination capability. It has been
observed that phase-based MGD feature extracted from different tonal vowels is statistically separate from each other in the
feature space even when they are different from each other only by tone or base-phone. This fact has been established by
statistical measures Euclidean distance and F-ratio test. The performances of the features have been evaluated with a HMM
based recognizer in terms of recognition accuracy. In 89.23% cases, the tonal vowels are recognized correctly by the HMM
based recognizer trained and tested with MGD features. In the present investigation, it has been observed that MGD features are
equally efficient in representing vowels with lexical tone (rising and falling) and vowels without any lexical tone (level tone). This
observation appeals more in-depth investigation of the MGD feature for using it as a parameterization technique for language
independent ASR system.
490878_1_En_54_Chapter_OnlinePDF.pdf
Acknowledgements
References
1. U. Bhattacharjee, U.: Recognition of the Tonal Words of Bodo Language, International Journal of Recent Technology &
Engineering. Vol-1(2013).tonal words of bodo language. Int. J. Recent Technol. Eng. 1, (2013)
2. Wang, H. M., Shen,J. L., Yang,Y. J., Tseng, C. Y. and Lee, S. L.: H.M. Wang, J.L. Shen, Y.J. Yang, C.Y. Tseng, S.L.
Lee, Complete Chinese dictation system research and development, Proceedings ICASSP-94, Vol. 1, development. in
Proceedings ICASSP-94, vol. 1. (1994), pp. 59–61 (1994).
3. C.J. Chen, C.J., Li,H., Shen, L. and Fu, G.K.: H. Li, L. Shen, G.K. Fu, Recognize tone languages using pitch information on
the main vowel of each syllable, Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP'01). acoustics,
speech, and signal processing. in Proceedings (ICASSP'01), 2001 IEEE International Conference on. Vol. 1. IEEE
(2001).on, vol. 1. (IEEE, 2001)
4. Chen,C. J., Gopinath,R. A., Monkowski, M. D., Picheny,M. A. and Shen,K.: C.J. Chen, R.A. 5. Mowlaee P, Saeidi R,
Gopinath, M.D. Monkowski, M.A. Picheny, K. Shen, in New Methods in Continuous Mandarin Stylianou Y.:
Speech Recognition, 5th European Conference on Speech Communication and Technology, Technology, vol. 3.
Vol. 3, pp. 1543 – 1546 (1997). (1997), pp. 1543–
1546
5. P. Mowlaee, R. Saeidi, Y. Stylianou, Phase importance in speech processing applications, InFifteenth applications. in
Fifteenth Annual Conference of the International Speech Communication Association (2014). (2014)
6. Yegnanarayana B, Sreekanth J, Rangarajan A.: B. Yegnanarayana, J. Sreekanth, A. Rangarajan, Waveform estimation using
group delay processing, IEEE transactions on acoustics, speech, and signal processing, Aug;33(4):832–6 (1985).processing.
IEEE Trans. Acoust. Speech Signal Process. 33(4), 832–836 (1985)
7. J. Deng, X. Xu, Z. Zhang, S. Frühholz, B. Schuller, Exploitation of phase-based features for whispered speech emotion
recognition. IEEE Access 4, 4299–4309 (2016)
8. Alsteris, Leigh D., and Kuldip K. Paliwal.: L.D. Alsteris, K.K. Paliwal, Short-time phase spectrum in speech processing: A a
review and some experimental results, Digital signal processing 17.3: 578–616 (2007).results. Digital Signal Process 17.3,
578–616 (2007)
9. R.M. Hegde, R.M., Murthy, H.A. andGadde, V.R.R.: H.A. Murthy, V.R.R. 10. Hernáez I, Saratxaga I, Sanchez J,
Gadde, Signi_cance of the modi_ed group delay feature in speech recognition, Navas E, Luengo I. recognition. IEEE
IEEE Trans. Audio, Speech, Language Process., vol. 15, no. 1,pp. 190_202, Trans. Audio Speech Lang. Process.
(2007). 15(1), 190–202 (2007)
10. I. Hernáez, I. Saratxaga, J. Sanchez, E. Navas, I. Luengo, Use of the harmonic phase in speakerrecognition. InTwelfth in
Twelfth Annual Conference of the International Speech Communication Association 2011. (2011)
11. Bozkurt B, Couvreur B. Bozkurt, L. Couvreur, On the use of phase information for speech recognition. In2005 in 2005
13th European Signal Processing Conference 2005 Sep 4 (pp. 1–4). IEEE.4. (IEEE, 2005), pp. 1–4
12. Banno H, Lu J, Nakamura S, Shikano K, Kawahara H. H. Banno, J. Lu, S. Nakamura, K. Shikano, H. Kawahara,
Efficient representation of short-time phase based on group delay. InProceedings in Proceedings of the 1998 IEEE
International Conference on Acoustics, Speech and Signal Processing, ICASSP'98 (1998). (1998)
13. H.A. Murthy, B. Yegnanarayana, Speech processing using group delay functions. Signal Processing Process. 22(3), 259–
267 (1991)
14. H. Murthy, H. and V. Gadde, V.: The modi_ed group delay function and its application to phoneme recognition, in Proc.
ICASSP, Hong Kong, pp. 68_71 (2003). Proceedings ICASSP (Hong Kong, 2003), pp. 68–71
15. D. Zhu, D. and K.K. Paliwal, K.K.: Product of power spectrum and group delay function for speech recognition, In Proc.
ICASSP 04, pp. 125_128 (2004).recognition. in Proceedings ICASSP 04 (2004), pp. 125–128
16. M.W. Post, T. Kanno, Apatani Phonology phonology and Lexicon, lexicon, with a Special Focus special focus on Tone.
tone. Himalayan Linguistics Linguist. 12(1), 17–75 (2013)
17. J.T. Sun, Tani languages, in The Sino-Tibetan Languages. ed. by G. Thurgood, R. LaPolla (Routledge, London and New
York, 2003), pp. 456–466
18. P.T. Abraham, Apatani-English-Hindi Dictionary (Central Institute of Indian Language, Mysore, India, 1987)1987).
19. U. Bhattachajee, J. Mannala, An Experimental Analysis of Speech Features for Tone Speech Recognition. International
Journal of Innovative Technology and Exploring Engineering experimental analysis of speech features for tone speech
recognition. Int. J. Innov. Technol. Exploring Eng. 9(2), 4355–4360 (2019)
20. H. Patro, G.S. Raja, S. Dandapat, Statistical feature evaluation for classification of stressed speech. Int. J. Speech Technol.
10(2–3), 143–152 (2007)
21. Rabiner, L. Rabiner et al.: al, HMM clustering for connected word recognition, recognition. in International Conference
on Acoustics, Speech, and Signal Processing,IEEE (1989).Processing (IEEE, 1989)
22. U. Bhattachajee, J. Mannala, Feature Level Solution to Noise Robust Speech Recognition level solution to noise robust
speech recognition in the context of Tonal Languages. International Journal of Engineering and Advanced Technology tonal
languages. Int. J. Eng. Adv. Technol. 9(2), 3864–3870 (2019)
[\({\text{a}}\)][\({\text{a}}\)][\({\text{a}}\)]
ʊ
[\({\text{a}}\)][\({\text{a}}\)][\({\text{a}}\)][\({\text{a}}\)]
ɑ:
[\({\text{a}}\)][\({\text{a}}\):]
ɛ
[\({\text{a}}\)][\({\text{a}}\)][\({\text{a}}\)]
ɔ
[\({\text{a}}\)][\({\text{a}}\)][\({\text{a}}\)]
ə -
[\({\text{a}}\)]
-
Table 2 Average euclidean distances of each tonal vowel from all the other vowels
Tonal Vowel Average Euclidean Tonal Vowelvowel Average Euclidean Tonal Vowel Average Euclidean
Distance euclidean Distance euclidean Distance euclidean
distance from the ot distance from the ot distance from the ot
her tonal vowels her tonal vowels her tonal vowels
[\({\text{a}}\)] 0.7513[ \({\text{a}}\)] 0.9267[ \({\text{a}}\)] 1.2091[