Spectral and Textural Features For Automatic

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Spectral and Textural Features for Automatic

Classification of Fricatives Using SVM

Alex Frid,,2, Yizhar Lavner'


'Department of Computer Science, Tel-Hai College, Israel
2 Edmond 1. Safra Brain Research Center for the Study of Leaming Disabilities, University of Haifa, Israel
alexfrid@gmail. com

Abstract - We report on an analysis of spectral and textural of individual phonemes within this group is fairly difficult due
characteristics of fricatives for their classification. Fricative to their similar acoustic-phonetic characteristics which stem
classification can be useful in applications such as differential from their production process [6].
manipulation of phonemes for the hearing impaired, where
people have difficulties in perception of fricatives. Several Earlier studies have investigated acoustic features of
acoustic time and frequency domain features were computed and fricatives for various applications of classification/detection
examined for constructing discriminative feature vectors, which [7]-[12]. In [7], a detailed analysis of fricative features is
enable accurate classification of fricatives for various speakers presented in an automatic classification framework. A
and dialects, and for varied contexts. The best sets of features for biologically oriented filter-bank was used as a front-end for
classification were selected using a floating-search procedure. classification of different aspects of fricatives. Place of
The evaluation included a data set of more than 18,000 fricatives, articulation was obtained by using five spectral-based features,
from more than 100 speakers. The classification stage included and voicing discrimination was based on using the duration of
training a support vector machine (SVM) on small part of the the unvoiced portion of the fricative with respect to that of the
data, initial classification of each signal frame (8-12 msec), and preceding vowel. In both discriminations tree-based decision
utilizing a majority vote for the feature vectors of the same rules were applied, with pre-defined thresholds. In [12], an
phoneme. An overall accuracy of 89% and 80% was obtained for
algorithm based on spectral moments for classification of
the unvoiced and voiced fricatives, respectively, and 97% for
unvoiced fricatives is proposed, in the context of lip
sibilants/non-sibilants discrimination.
synchronization for facial animation. In [2] both spotting and
Keywords - Fricative classification, Support Vector Machine
classification of some fricatives were performed in the context
(SVM)
of application for the hearing impaired. Pre-defined decision
rules and maximum-likelihood procedure based on spectral
energy ratios were used for the classification.
I. INTRODUCTION
In this study we analyzed and examined a large set of
Classification of phonemes is the process of assigning a
phonetic category to a short section of speech signal. It is a key spectral and textural characteristics of the fricative phonemes
stage in various applications such as spoken term detection, for their discrimination. Our aim has been to construct
continuous speech recognition and music to lyric discriminative feature vectors, in order to obtain a high
synchronization, to name a few. classification rate for different speakers with various dialects
and accents, and for different tasks. Special effort was made to
In this study, we focus on analysis of discriminative
discriminate between the two pairs of non-sibilants (IfI and
features for classification of the fricative phonemes. This Ith/, and Ivl and Idh/), which were handled as one group in
classification should be applicable in the professional music
former studies [7]. After constructing the feature vectors, we
industry [1], or in developing technologies for the hearing
trained a support vector machine (SVM) and evaluated its
impaired [2]. In the music industry, it may be used in
performance on more than 18,000 fricatives from the TIMIT
applications where pre-defined phonemes should be selectively
processed, as in de-essing [3]. In technologies for the hearing database. As opposed to former studies, we did not employ
impaired, when people have difficulties in perception of the any manual exclusion, and used the information from the
high frequency range, and may not discriminate between fricatives only, without cues from adjacent phonemes.
fricatives, these phonemes can be selectively modified, and
thus their perception could be improved [4]. In both II. METHODS
applications fricative classification may be used as a second
stage in a three stage system, where in the first stage, a fricative A. Speech material
spotting is applied to detect all instances of fricatives within a The speech material for trammg and evaluation of the
given utterance [5]. After the classification stage, a differential classification procedures includes voiced and unvoiced
manipulation may be applied for each phoneme. fricatives derived from the TIMIT speech database [13], using
Fricatives are characterized by a relatively large amount of 8 dialect regions. For accessing and filtering the required
spectral energy in the high frequency range. The classification fricatives we used MatlabADT [14].

lWSSIP 2014, 21st International Conference on Systems, Signals and Image Processing, 12-15 May 2014, Dubrovnik, Croatia
99
B. Pre-processing • Spectral Centroid: The spectral centroid is defined as the
After extraction of the isolated phoneme signals, each center of gravity of the magnitude spectrum of the Short­
phoneme segment was divided into consecutive analysis Time Fourier Transform (STFT):
frames of 8 milliseconds, with an overlap of 50%. For S, = �M,(k)-k/�M,(k) (2)
frequency domain analysis each frame was multiplied by a
hamm ing window and pre-emphasized. For each frame a where M, (k) is the magnitude of the Fourier transform
vector of characteristic features was then constructed for
as in Eq. (1). Sibilants, both voiced and unvoiced are
analysis and classification.
expected to have a higher value of spectral centroid
compared to non-sibilants.
C. Feat ure Analysis andExtraction
• Band Energy Ratio: Band energy ratio is defined as the
To obtain discriminative features for fricatives
classification we extracted and analyzed multiple features that ratio of the spectral energies of two frequency bands
may characterize individual phonemes. These can be divided E and E :
B1 82

(E8JE8,)
into spectral features and textural features. The spectral
features aim at capturing different aspects of the fricative Et = 10 log 10 (3)
spectra, and were derived from spectral representations of the where BI =4-8kHz andB = 2-4kHz·
fricatives such as the Discrete Fourier Transform (DFT), Mel­ 2
• The spectral energy below 1kHz, denoted as LE:
Frequency Cepstrum (MFC), or spectral envelope of the linear
LE= 2:(M,(k))2
Kf
prediction analysis. The textural features include characteristics (4)
such as energy, zero-crossing statistics and lacunarity (see k=O
below). In preliminary experiments, we used 1600 fricatives where Kf is the bin corresponding to 1 kHz.
for estimation of classification errors of a single feature
• Mel Frequency Cepstral Coefficients (MFCCs):
discriminator for each feature. In addition, correct classification
rates for various combinations of features were computed. The Only the first three coefficients (out of 26) were found to
most discriminative features were selected using floating­ be discriminative.
search procedure [15] for further evaluation, and are described
• Gammatone Maximal Peak Frequency (GMPF) as
as follows:
described in [17].
1) Spectral feat ures
2) Textural feat ures
• Spectral Peak Locations [6], [7]: These are the frequency
• Zero-Crossing Rate (ZCR): The ZCR is defined as the
locations of the peaks of the spectral envelope derived
number of times the audio waveform changes its sign
from linear prediction. Sibilants such lsi and Ish! are
along the frame:
characterized by distinctive spectral peaks, where the N-I
1 I:lsgn(x[ ])-sgn(x[ 1 (6)
frequency of the lower spectral peak of Is! is higher than ZCR =-

2,,=1
- ])1,
n n

that of Ish! (4 kHz and 2.5 kHz, respectively, [7]). The


spectrum of non-sibilants (If! and Ith!) is flatter. The where x(n) is the time domain signal for frame t.
frequencies and amplitudes of the first two highest peaks • Standard Deviation, Skewness and Kurtosis of Zero­
were detected and examined. Crossing intervals: These features were computed using
• Spectral Rolloff: The spectrum rolloff point is defined as the statistics of the time intervals between consecutive
the boundary frequency, below which p percent of the zero crossings.
magnitude distribution is concentrated: • Lacunarity � parameter [18]. Lacunarity [19] is a textural
j, K-I
(1) measure which examines the deviation from translational
LM, (k) =p' LM, (k), invariance of the signal, that is, how short sections of the
k=O k=O

where M,(k) is the magnitude of the Fourier transform at signal differ from one another. It is computed as follows:
1. Applying a sliding window of length r over the
frame t and frequency bin k , and the Kth bin frame, compute the total "mass" S(r) of the signal inside
corresponds to Nyquist frequency. Using this feature is
the window, where the "mass" is the total number of
justified by the observation that sibilants, both voiced
ones in the binary signal derived from the audio signal:
and unvoiced, have more spectral energy in the high
frequency range [16]. Therefore, the spectral rolloff point )
{I if x( )
n > 0
(9)
Ybinary n .
of sibilants is expected to be lower than that of non­ ( = 0 o th erwlse

sibilants. For example, as can be seen in Figure 1, the 2. Compute the mean and variance of S(r) , and the
pdf's of Ishl and Ithl are different and may help to lacunarity measure, defined as:
discriminate between these phonemes. We used values of L(r) = (Var[S(r)JIm2 [S(r)J)+ I (10)
18%, 25%, 50% and 75%.
3. Applying a least-squares approximation to the points
of L(r), obtain the parameters of the best-fitting function
of the form (alxP+y).

lWSSIP 2014, 21st International Conference on Systems, Signals and Image Processing, 12-15 May 2014, Dubrovnik, Croatia
100
··-/$1
TABLE 11. DETAILED ANAL YSIS FOR VOICEDluNVOICED
-'hI
....
x 10

-/shl DISCRIMINATION
7
-Ish!

6
Detected as: voiced unvoiced
5
1.5 Ishl 0.7% 99.3%
lsi 3.8% 96.2%
If! 5 .5 % 94.5%
0.5 Ithl 15.0% 85.0%
Iv 39.4% 60.6%
1()()() 6000 7000 0.5
Ivl 88.8% 11.2%
Figure I: Estimations of the probability density functions(pdfs) of spectral
Idhl 66.0% 34.0%
rolloff(p=0.5, left, where the x axis represents the rollofffrequency), and
Izhl 40.9% 59.1%
Lacunarity p parameter(right).
B. Classification of unvoicedfricatives
D. Processing andClassification Procedure
We used a non-linear support vector machine as a classifier. In this experiment we aimed at detecting the place of
In non-linear SVM, the feature vectors are mapped in a high­ articulation of the unvoiced fricatives, i.e., we classified this
dimensional space, where the classes are expected to be group into individual phonemes: lsi, Ish!, If! and Ith!. A test set
linearly separable. We used UBSVM [20], which is an of 11,219 phonemes, which included almost all unvoiced
implementation of the algorithm presented in [21] for the fricatives of TIMIT, was used for evaluation. A cross­
different classification tasks, with radial basis kernel function. validation procedure of 100 rounds was applied, resulting in
The data for the SVM training process consisted of several an average correct identification of 86.9%. The results are
hundred phonemes of each type, which were derived from summarized Table III. The discrimination accuracy of the
eight different English dialects from TIMlT. Feature vectors sibilants was about 90% (91.6% for Ishl and 88.7% for lsi),
were produced, one for each frame. All the feature vectors and that for non-sibilants was almost 80%.
were automatically scaled to zero mean and unit variance, and
T ABLE III. CONFUSION MATRIX OF CORRECT IDENTIFICA TlON OF
the free parameters were selected using a grid search with UNVOICED FRICATIVES FOR 11848 PHONEMES.
polynomial scale and cross-validation procedure. A limited set
Detected as: lsi If! Ithl Isftl
of phonemes, several hundred of each type, were used for lsi 88.7% 0.5% 3.3% 7.5%
feature selection and for the training of the classifier. If! 0.60% 80.0% 17.0% 1.7%
Itftl 4.6% 14.8% 77.9% 2.6%
The evaluation of the trained SVM was carried out on the Isftl 4.3% 3.9% 0.2% 91.6%
whole set of fricatives from TIMIT (excluding the training set)
for the following classification tasks: 1. Discrimination In another experllllent, the dlscnmmatLOn between unvOIced
between voiced and unvoiced fricatives; 2. Classification of sibilants (lsi, Ish/) and non-sibilants (IfI, Ith!) was tested. The
unvoiced fricatives; 3. Classification of voiced fricatives. We evaluation included 8340 sibilants and 2879 non-sibilants. The
also examined the discrimination between sibilants and non­ results are shown in Table IV. As can be seen, the separation
sibilants. After applying the classification scheme for each between these groups produced highly accurate results: 99.0%
frame, a majority vote decision is applied to all frames included of the sibilants and 96.2% of the non-sibilants were correctly
in the same phoneme segment, to obtain a final classification. identified. In addition, the results of the discrimination within
these two groups, the sibilant group and the non-sibilant group,
III. RESULTS were obtained and shown in Table V and Table VI,
respectively. It can be seen that a very high identification rate
A. Discrimination between voicedand unvoicedfricatives was achieved for the classification of unvoiced sibilants into lsi
In this experiment we used a total of 2620 phonemes, and Ish!, while the rate for the non-sibilants was somewhat
(1141 unvoiced fricatives (lsi, Ishl, If! and Ith!) and 1209 lower.
voiced fricatives (/Z/, Izh/, Ivl and Idb/)). The libSVM was T ABLE IV. CONFUSION MATRIX OF THE CORRECT IDENTIFICATION
trained on a subset of these phonemes (30%). Using 50 cross­ (%) BETWEEN SIBILANTS {lSi,IsH!} AND NON-SIBILANTS {/F/,ITH!}

validation rounds, an average identification rate of 85.2% was Detected as: sibilants non-sibilants
obtained (Tables I and II). Interestingly, most of the unvoiced {lsI, Isftl} 99.0% 1.0%
fricatives were correctly detected (92%) compared to the {If/, Itftl} 3.8% 96.2%
voiced fricatives (73%). This was probably due to the fact that TABLE V. CONFUSION MATRIX FOR DISCRIMINATION WITHIN
many of the fricatives in TIMlT which are transcribed as Iz/ or SIBILANTS
Idhl are actually voiceless, i.e. pronounced without phonation. Detected as lsi Isftl
This was also found in [7]. lsi 99.0% 0.1%
TABLE 1. CONFUSION MATRIX FOR VOICEDluNVOICED
Ishl 2.9% 97.1%
DISCRIMINATION
TABLE VI. CONFUSION MATRIX FOR DISCRIMINATION BETWEEN
voiced unvoiced NON-SIBILANTS
73% 27%
8% 92%
Detected as: If! Ithl
If! 85.6% 14.4%
Ithl 17.8% 82.2%

lWSSIP 2014, 21st International Conference on Systems, Signals and Image Processing, 12-15 May 2014, Dubrovnik, Croatia
101
C. Classification of voicedfricatives People with Special Needs, K. Miesenberger, J. Klaus, and W. Zagler,
Eds. Springer Berlin Heidelberg, 2002, pp. 153-161 [Online].
A total of 7585 phonemes were used (lvl - 1855, Idhl - Available: http://link.springer.com/chapterII0.1007/3-540-45491-
2462, /zl - 3lO0 and Izhl - 168), of which 3% of the phonemes 8_32. [Accessed: 04-Jul-2013]
were used for training the SVM. The average correct [3] M. Wolters, "State of the Art Speech Processing for Broadcasting, "
identification rate in lOO cross-validation rounds was 80%, presented at the Audio Engineering Society Convention 106, 1999
[Online]. Available: http://www.aes.org/e-Iib/browse.cfm?elib=8303.
while the rates for IzJ, /Zh/, Ivl and Idhl were 86.3%, 85.9%,
[Accessed: 04-Jul-2013]
76.4% and 73.1%, respectively (Table Vll). [4] R. Cohen, O. lerushalmi, Y. Lavner, M. Genussov, and A. Steiner,

TABLE VII. CONFUSION MATRIX FOR VOICED FRICATIVES. "Manipulation of Unvoiced Fricatives for Improved Discrimination by
Hearing Impaired, " presented at the iberSpeech, Madrid, SPAIN,
Detected as: Iz/ Ivl Idhl Izhl
2012.
Iz/ 86.3% 1.0% 2.6% 10.1%
[5 ] D. Ruinskiy, N. Dadush, and Y. Lavner, "Spectral and textural
Ivl 0.6% 76.4% 21.7% 1.3%
feature-based system for automatic detection of fricatives and
Idh/ 3.5% 22.5% 73.1% 0.9% affricates, " in 2010 IEEE 26th Convention of Electrical and
Izhl 9.8% 3.0% 1.3% 85.9% Electronics Engineers in Israel (IEEEI), 2010, pp. 000771-000775.
[6] P. Ladefoged and I. Maddieson, "The sounds of the world's
Dlscnmmatmg the vOiced sibilants {/zJ, Izh/} and non­
languages, " in The sounds of the world's languages, vol. 5 , 1996.
sibilants, {lvi, Idh/}, using 6% of the phonemes for training and [7] A. M. A. Ali, 1. V. der Spiegel, and P. Mueller, "Acoustic-phonetic
the remainder for testing, produced correct identification of features for the automatic classification of fricatives, " J. Acoust. Soc.
96.2% (95.8% for sibilants and 96.6% for non-sibilants, see Am., vol. 109, no. 5, pp. 2217-2235 , 200I.
Table VIII), using 3325 sibilants phonemes and 4172 non­ [8] H. Fu, R. D. Rodman, D. F. McAllister, D. L. Bitzer, and B. Xu,
"Classification of Voiceless Fricatives through Spectral Moments, "
sibilants.
presented at the Proceedings of the 5 th International Conference on
TABLE VIII. CONFUSION MATRIX FOR VOICED SIBILANTS VERSUS Information Systems, Analysis, and Synthesis (ISAS'99),
NON-SIBILANTS. Skokie,IL:lnternational Institute of Informatics and Systemics, 1999,
Detected as: {IV, Izltl} {lv!,ldhl} pp. 307-311.
{IV, Izltl} 95.8% 4.2% [9] G. Hu and D. Wang, "Separation of Fricatives and Affricates, " in
{Iv!,/dltl} 3.4% 96.6% iEEE international Conference on Acoustics, Speech, and Signal
ProceSSing, 2005. Proceedings. (ICASSP '05), 2005, vol. I, pp. 1101-
Further classificatIOn of the vOIced frIcatives, where the
1104.
non-sibilants are grouped as one phoneme, results in an [10] M. Genussov, Y. Lavner, and I. Cohen, "Classification of Unvoiced
overall accuracy of 93%. Fricative Phonemes using Geometric Methods, " presented at the Proc.
12th International Workshop on Acoustic Echo and Noise Control,
IV. CONCLUSIONS IWAENC-201O, Tel-Aviv, Israel, 2010.
[II] 1.-W. Lee, 1.-Y. Choi, and H.-G. Kang, "Classification of Fricatives
The aim of this study was to evaluate different features for Using Feature Extrapolation of Acoustic-Phonetic Features in
fricatives discrimination. We examined multiple features, from Telephone Speech, " presented at the Twelfth Annual Conference of
which a small subset was selected for the classification of the International Speech Communication Association.

various groups of fricatives, using floating search procedure. INTERSPEECH-2011., Florence, Italy, 2011, pp. 1261-1264.
[12] H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno, "LyricSynchronizer:
Several textural features, which were not used previously for
Automatic Synchronization System Between Musical Audio Signals
this task, are presented, for example the lacunarity of the signal and Lyrics, " iEEE J. Sel. Top. Signal Process., vol. 5, no. 6, pp. 1252-
and moments of the zero-crossing intervals. A support vector 1261, 2011.
machine was trained for the evaluation. A large set of fricative [13] V. Zue, S. Seneff, and J. Glass, "Speech database development at

phonemes, which included more than 18,000 fricatives, was MIT: Timit and beyond, " Speech Commun., vol. 9, no. 4, pp. 351-
356, Aug. 1990.
used. The fricatives were extracted from the TIMIT database
[14] SIPL, "MatlabADT - Audio Database Toolbox. " [Online]. Available:
from texts uttered by more than 100 female and male speakers, http://sipl.technion.ac.iI/Info/Downloads_MatlabADT_e.shtml
from eight dialect regions. High correct identification rates [15 ] P. Pudil, 1. Novovicova, and J. Kittler, "Floating search methods in
were achieved for unvoiced fricatives (87%) and voiced feature selection, " Pattern Recognit. Lett., vol. 15 , no. II, pp. 1119-
fricatives (80%), despite the fact that no manual exclusion was 1125, Nov. 1994.
[16] L. J. Raphael, G. J. Borden, and K. S. Harris, Speech Science Primer:
performed, and that individual non-sibilans were used in the
Physiology, Acoustics, And Perception of Speech. Lippincott Williams
classification. Very high identification rates were achieved for & Wilkins, 2007.
the classification between sibilants and non-sibilants, for both [17] D. Ellis, "Gammatone-like spectrograms. " 2009 [Online]. Available:
voiced and unvoiced fricatives. The results suggest that the http://www.ee.columbia.edu/-dpwe/resources/matlab/gammatonegra
features used here are highly discriminative, and that the mI
procedure could be utilized in technologies for the hearing [18] L. 1. Hadjileontiadis, "A Texture-Based Classification of Crackles and
Squawks Using Lacunarity, " JEEE Trans. Biomed. Eng., vol. 5 6, no.
impaired, for example by applying a differential processing of
3, pp. 718-732, 2009.
the classified fricatives, to improve their perception. [19] R. E. Plotnick, R. H. Gardner, W. W. Hargrove, K. Prestegaard, and
M. Perlmutter, "Lacunarity analysis: A general technique for the
REFERENCES analysis of spatial patterns, " Phys. Rev. E, vol. 5 3, no. 5, pp. 5461-

[I] D. Ruinskiy and Y. Lavner, "An Effective Algorithm for Automatic 5468, May 1996.

Detection and Exact Demarcation of Breath Sounds in Speech and [20] C.-C. Chang and C.-J. Lin, "LlBSVM: A library for support vector

Song Signals, " IEEE Trans. Audio Speech Lang. Process., vol. 15 , no. machines, " ACM Trans Intell Syst Technol, vol. 2, no. 3, pp. 27:1-

3, pp. 838-85 0, 2007. 27:27, May 2011.

[2] D. Bauer, A. Plinge, and M. Finke, "Selective Phoneme Spotting for [21] R.-E. Fan, P.-H. Chen, and C.-J. Lin, "Working Set Selection Using

Realization of an Is, z, C, t/ Transposer, " in Computers Helping Second Order Information for Training Support Vector Machines, " J
Mach Learn Res, vol. 6, pp. 1889-1918, Dec. 2005.

lWSSIP 2014, 21" International Conference on Systems, Signals and Image Processing, 12-15 May 2014, Dubrovnik, Croatia
102

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy