Springer SPECOM Parismita
Springer SPECOM Parismita
1 Introduction
Over the years, linguistic rhythm research has been focused on the idea that
acoustic measures of duration of vowels and consonants can be assessed in order
to classify languages according to rhythmic templates. Based on the rhythm
class hypothesis, languages can be classified as syllable-timed, stress-timed, and
mora-timed [1,7,18,21,22]. Rhythm metrics are defined as the formulas that
quantify vocalic and consonantal variability used for topological studies in clas-
sifying languages rhythmically. The stress-, syllable- and mora-timing distinc-
tion in languages is quantified based on vowels and consonants duration, rather
than syllables or stress feet. It is reported that Spanish (syllable-timed) has
less complex consonant clusters and less vowel reduction as compared to English
(stress-timed) [5]. Low et al. demonstrated that identification of vowel/consonant
boundaries is rather straightforward compared to syllabification rules that differ
in world languages [16].
In 1999, Ramus et al. carried out a preliminary attempt to quantify conso-
nantal and vocalic variability by proposing the standard deviation of vocalic and
!c Springer Nature Switzerland AG 2022
S. R. M. Prasanna et al. (Eds.): SPECOM 2022, LNAI 13721, pp. 201–213, 2022.
https://doi.org/10.1007/978-3-031-20980-2_18
202 P. Gogoi et al.
consonantal interval duration (‘∆V’ and ‘∆C’ respectively) [25]. The percentage
of utterance duration that is vocalic rather than consonantal is termed as (%V).
Rhythm classification in increasing %V order was performed on previously clas-
sified languages (Dutch/English/Polish, Catalan/French/Italian/Spanish, and
Japanese) to reflect statistically in the rhythm continuum, using a combination
of ∆C and %V. The pairwise variability indices nPVI and rPVI (pairwise com-
parisons of successive vocalic and intervocalic intervals) were later introduced
by Grabe and Low [11,16]. PVIs capture syntagmatic distinction over an utter-
ance by averaging vocalic/consonantal durational differences [19]. Low et al.
identified a Singaporean and a British dialect of English based on PVI-based
measures [16]. Speech rate (SR) tends to show a high correlation with interval
duration measures based on variance. It is reported that with a slower rate,
lengthening of intervals takes place [6]. Coefficients of variation for consonantal
intervals (VarcoC: [6]) and vocalic intervals (VarcoV: [9]) are also explored to
implement speech rate normalization. At best, metrics such as VarcoV and %V
are approximate indicators of broad phonetic and phonotactic patterns [6].
Using normalized PVI (nPVI)-based metrics, interval durations were normal-
ized to understand speech rate variation also [16]. Speech rate tends to show high
correlation with interval duration measures based on variance. It is reported that
with slower rate, lengthening of intervals take place [6]. Many authors presented
with newer methods of quantifying rhythmic distinctions [15,27]. Coefficients
of variation for consonantal intervals (VarcoC: [6]) and vocalic intervals (Var-
coV: [9]) are also explored to implement speech rate normalization. At best,
metrics such as VarcoV and %V are approximate indicators of broad phonetic
and phonotactic patterns [6].
Japanese) with results that were comparable to Ramus et al. (1999) [25]. In
an approach, using the knowledge of loudness and periodicity, annotations of
acoustic waveforms were carried out for calculating units relevant for rhythmic
identification [17]. Fifteen Rhythm metrics, including %V, ∆C, and PVIs, were
measured to rhythmically classify Russian, Greek, and Taiwan Mandarin, South-
ern British English, French languages. The accuracy rates were found to be in
the range of 34% to 43% above chance. Another method utilized an automatic
vowel detection algorithm [27], originally applied in a language identification
task [8]. The methodology could classify the vowel and non-vowel portions and
parse the utterance string into pseudo syllables. Another work has been reported
to demonstrate a rhythm metric-based LID approach, using GMM based learn-
ing [14] for a multi-lingual system. Total consonant cluster duration, total vowel
duration, and complexity of the consonantal cluster (i.e., number of the conso-
nants in the cluster) were computed from each pseudo-syllable. Then, 7 languages
(English, German, Mandarin. French, Italian, Spanish, and Japanese) were clas-
sified with mean correct rhythmic accuracy from 80% for mora-timed to 92% for
stress-timed languages using Gaussian Mixture Models (GMM). Another work
has been reported to demonstrate a rhythm metric-based LID approach, using
GMM based learning [14] for a multi-lingual system.
Our current work aims at exploring rhythmic variability in the Mising lan-
guage of the Eastern Tani sub-group of the Tibeto-Burman language family
in North-East India [24]. Most of the Tibeto-Burman (TB) languages are yet
to be fully documented or scientifically described. Eight Mising dialects are tra-
ditionally recognized: Pagro, Delu, Sayang, Oyan, Moying, Dambuk, Somuwa,
and Samuguria [28]. Despite having a large number of native speakers of different
dialects, the phonetics of prosody in Mising has not previously been investigated
by researchers. There is next to no work on Mising speakers’ rhythm analysis in
dialectal data. And therefore, rhythmic characteristics and its difference among
the dialects are largely unknown.
We focus on identifying the rhythm class of Mising as per the rhythm class
hypothesis in our preliminary attempt, using automatic computation of rhythm
measures. This investigation will shed light on many questions on the stress,
rhythm, and prosodic behavior of Tibeto-Burman languages, which are still
important and open to understanding. Considering the limitations pointed out
earlier regarding the manual assessment of rhythm measures, we employ an auto-
mated method of vowel onset (VOP) and offset points (VEP) detection from
acoustic signals for measuring the interval durations and utilize these values to
compute rhythm measures [23]. The rhythm measures are calculated from conso-
nantal and vocalic intervals derived from the VOP and VEP detection method.
This automated method will be helpful compared to transcription-based forced
alignment techniques, where linguistic knowledge is a must for processing the
204 P. Gogoi et al.
audio files. An LID approach is proposed between Mising and Assamese lan-
guages using the speech rhythm measures and speech rate as a feature set. This
work studies the rhythm characteristics of two Mising dialects, namely Pagro
and Delu, using spontaneous speech data. Assamese, spoken by about 15 million
people in the state of Assam, is an Indo-European language. Assamese data is
also considered for language identification tasks with Mising. The rhythm mea-
sures of Assamese are comparable to that of the mora-timed languages, such as
Japanese [7].
To summarize, the following are the contributions of the present paper-
2 Database Preparation
2.1 Speakers
Fig. 1. Box plots of the rhythm measures for two dialects of Mising and Assamese.
nine Assamese (A) speakers. Assamese has been explored as the control language
to understand the prosody of two broad categories of language, namely Indo-
European and Tibeto-Burman language (Mising), respectively. None of the 28
speakers reported any speech disorders during the recording. A detailed descrip-
tion of the database can be found in Table 1.
2.2 Materials
The principal elicitation method used in this study is spontaneous speech. Five
topics were provided to the speakers of both languages for recording. The topics
were mainly related to their daily livelihoods and ethnocultural practices of the
Assamese and Mising communities, respectively. The five topics for Assamese
were- Bihu festival, introduction to own village, weaving methods of Assamese
dresses, Assamese community, and ethnic food preparation methods. The five
topics in Mising were- Ali-Aye-Ligang festival, introduction to own village, weav-
ing methods of Mising dresses, Mising community, and ethnic food preparation
methods. The duration of the spontaneous speech recordings ranged from 1 min
to 4 mins.
206 P. Gogoi et al.
Table 2. Overall rhythm measures for Assamese (A), Pagro (P) and Delu (D).
Rhythm measures A (µ ± σ) P (µ ± σ) D (µ ± σ)
%V 43.04 ± 7.15 38.37 ± 8.21 41.51 ± 8.13
∆V 0.05 ± 0.02 0.086 ± 0.02 0.093 ± 0.029
∆NV 0.17 ± 0.08 0.292 ± 0.20 0.26 ± 0.14
Varco-V 52.25 ± 16.30 68.86 ± 14.10 70.12 ± 13.79
Varco-NV 98.61 ± 28.66 132.55 ± 29.25 126.06 ± 31.49
nPVI-V 46.19 ± 10.48 59.34 ± 8.96 59.86 ± 9.20
nPVI-NV 66.58 ± 14.63 81.10 ± 14.55 78.02 ± 12.96
SR 3.45 ± 0.72 2.89 ± 0.74 2.75 ± 0.63
3 Methodology
This section discusses the methodology to automatically derive the rhythm mea-
sures, by segmenting the vowel (V) regions and non-vowel (NV) regions. Here,
V-region includes the vowels present in the speech signal, whereas, NV corre-
sponds to the consonants and pauses. The segmentation of V and NV regions
are performed using an algorithm proposed in [23]. This method uses excitation
source information, such as zero-frequency filtered signal and Hilbert envelope of
the linear prediction residual to locate the VOP and VEP in a speech signal [23].
We have considered speech region from one VOP to adjacent VEP as the V-
region, whereas, one VEP to adjacent VOP is considered as the NV-region. For
more detailed description interested reader can refer to the original paper [23].
The steps to derive the rhythm measures automatically are summarized as given
below.
To analyze rhythm, five interval measures, namely, %V, ∆V, ∆NV, Varco-V
and Varco-NV are calculated. %V is the percentage of V-regions in an utter-
ance. ∆NV is the standard deviation of the duration of NV-regions, and ∆V is
the standard deviation of the duration of V-regions [25]. Varco-NV is defined
as the percentage of the standard deviation of NV interval duration (∆NV)
of the average duration of V-regions (mean NV). Similarly, Varco-V is calcu-
lated from the V-regions. Two Pairwise Variability Index measures, nPVI-V and
nPVI-NV are evaluated following [19]. nPVI-V is the rate normalized measure of
the durational variation of two consecutive V-regions, and nPVI-NV is the rate
normalized measure of the durational variation of two consecutive NV-regions.
Automatic Rhythm and Speech Rate Analysis of Mising Spontaneous Speech 207
manual annotation-based computation for five Assamese varieties [7]. Pagro and
Delu are placed closer to British English. From this plot, stress- timing is found
to be dominant in two varieties of Mising. It is believed that in stressed timed
languages, codas and consonant clusters contribute to a greater consonantal
portion of the signal [26]. In Mising, seven long and seven short vowels are
present, which is a marker of rhythm [28]. Additionally, diphthongization is more
robust on long vowels [20] and the presence of long vowels in a language is
connected with greater durational variability. Greater durational variability is
reported in stress timed language due to vowel reduction, which is measured
from metrics PVI, DeltaV, Varco-V. Mora timed languages generally exhibit
a simpler syllable structure. Researchers have provided significant attention in
the possibility of an interaction between suprasegmental rhythm and vocabulary
systems. Many works have been concentrated on criteria of segmental phonology
to link to rhythm class hypothesis [26].
In the process of capturing hypothesized rhythm class in language continuum,
rate normalized metrics based on vowels, Varco-V and nPVI-V are found to be
most reliable. Hence, Varco-V values are plotted on the vertical axis against
nPVI-V values on the horizontal axis, as shown in Fig. 3. The PVI profiles depict
acoustic evidence for rhythmic differences between English, Mising (Pagro, Delu)
on the one hand, and Spanish on the other. Mora- timed Japanese and Assamese
are patterned between the stress-timed and syllable-timed language. Stressed
time language is said to exhibit more vocalic variability (high vocalic nPVI-V)
than syllable-timed languages related to vowel quality.
The findings of the present method seem to be consistent with the pre-
vious method reported in [7]. In Fig. 4, RM values computed using proposed
Automatic Rhythm and Speech Rate Analysis of Mising Spontaneous Speech 209
Fig. 4. Comparison of Rhythm measure (RM) values computed using proposed auto-
matic method for spontaneous speech recorded speakers from Upper Assam (Proposed
(UA-Spon)) and Praat-based manual annotation (Manual (JOR-Read) and Manual
(TIN-Read)) reported in [7] for read speech recorded from speakers from Jorhat and
Tinsukia, two district from Upper Assam region.
Table 3. Wald χ2 tests on LME models for rhythm measures (RM) and speech rate
(SR) for Assamese and Mising.
This work explores two machine learning models, viz. SVM and RF to investigate
the effectiveness of automatically computed Rhythm measures and speech rate
for classifying Mising (Pagro and Delu combined) and Assamese language. The
SVM maps the input features into high dimensional space so that features can
be linearly separable [4]. RF is an ensemble of decision trees, and its prediction
for input will be the class voted by most trees [13].
The models are trained using the 4-fold cross-validation and in a speaker-
independent manner. In each iteration, speech data from three folds (approx.
80% of total data) are used for training, and the remaining one fold is used
to evaluate the performance. The training set is further divided into the actual
training set, which is used to train the model, and a development set, which is
used to optimize the model’s hyperparameters. Thus, the models are evaluated
four times, and at each fold, accuracy and F1-score on the test set are considered
as evaluation metrics. We use the grid-search method to tune the hyperparame-
ters of SVM, such as C and γ, and RF, such as the number of trees. The results
Automatic Rhythm and Speech Rate Analysis of Mising Spontaneous Speech 211
Table 4. Classification results of the 4-fold cross validation for Assamese and Mising.
are noted in terms of mean (µ), standard deviation (σ) accuracy, and the F1-
score of the four-folds. The speakers in the train set are not included in the
test set in each fold. Hence, the classification results are reported in a speaker-
independent manner. The classification results of the 4-fold cross-validation are
mentioned in Table 4. The models are developed for three different combinations
of features, such as RM, RM and SR, and RM and SR excluding %V. The aver-
age accuracy of 81.48% and average F1-score of 81.12% are observed in the case
of the RM-based SVM model. And similar performance can also be seen for the
RF classifier. Inclusion of SR with the RM provides around a 1% improvement
for the SVM. However, it is found that excluding the %V features improves the
SVM-based system performance. The table shows that RM and SR computed
automatically using the VOP and VEP detection can be utilized in classifying
Mising and Assamese. Figure 5 shows the contribution of each feature in classify-
ing Mising and Assamese. Rhythm measures related to the vowel regions (except
%V) are very important in the classification. Further investigation needs to be
carried to identify the possible cause of this trend.
This paper discusses a methodology for computing the rhythm measures and
speech rate by automatically locating VOP and VEP from spontaneous speech.
This automated method relies on acoustic information alone, which is bene-
ficial compared to labour-consuming manual annotation and forced alignment
212 P. Gogoi et al.
methods. We have analyzed the rhythm measures of Mising, a low resource lan-
guage spoken in Assam, and performed a comparative study with Assamese, the
official language of Assam. From the analysis, it is found that Mising is more
stress-timed on the language continuum. And Assamese falls in the mora-timed
language category, which has been validated as per previous studies [7]. A signif-
icant difference is observed between Mising and Assamese for all the measures,
except %V. However, between the two dialects of Mising, no statistical signif-
icance difference is noted, which can be seen from the eight feature boxplots
between Pagro and Delu types. LID systems are designed using machine learn-
ing models such as SVM and RF, considering combinations of rhythm measures
and speech rate. SVM-based system with 7- dimensional feature set seems to
provide the best accuracy of 83.10% and 82.34% F1-score.
We have observed significantly large values for the Varco-NV in Assamese
and two Mising varieties. The value of the Varco-NV for Assamese is found to
be high as compared to [7]. One possible reason may be due to the inclusion of
the silence region in the NV region. Future work is planned to further investigate
this measure in more detail. Moreover, current work only considers two Mising
dialects; hence, rhythm analysis of other Mising dialects is also planned in future
research to investigate between- and within-speaker rhythmic variability.
References
1. Abercrombie, D.: Elements of General Phonetics. Edinburgh University Press,
Edinburgh, Scotland (1980)
2. Bates, D., Mächler, M., Bolker, B., Walker, S.: Fitting linear mixed-effects models
using lme4. J. Stat. Softw. 67(1), 1–48 (2015)
3. Boersma, P., Weenink, D.: Praat: doing phonetics by computer (version 5.1.13)
(2009). http://www.praat.org
4. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297
(1995)
5. Dauer, R.M.: Stress-timing and syllable-timing reanalyzed. J. Phonet. 11(1), 51–62
(1983)
6. Dellwo, V., Wagner, P., Solé, M., Recasens, D., Romero, J.: Relations between
language rhythm and speech rate (2003)
7. Dihingia, L., Sarmah, P.: Rhythm and speaking rate in assamese varieties. In:
Proceedings of 10th International Conference on Speech Prosody 2020, pp. 561–
565 (2020)
8. Farinas, J., Pellegrino, F.: Automatic rhythm modeling for language identifica-
tion. In: Seventh European Conference on Speech Communication and Technology
(2001)
9. Ferragne, E., Pellegrino, F.: Rhythm in read British English: interdialect variability.
In: Eighth International Conference on Spoken Language Processing (2004)
10. Galves, A., Garcia, J., Duarte, D., Galves, C.: Sonority as a basis for rhythmic
class discrimination. In: Speech Prosody 2002, International Conference (2002)
11. Grabe, E., Low, E.L.: Durational Variability in Speech and the Rhythm Class
Hypothesis, pp. 515–546. De Gruyter Mouton (2008)
Automatic Rhythm and Speech Rate Analysis of Mising Spontaneous Speech 213
12. Grenon, I., White, L.: Acquiring rhythm: a comparison of l1 and l2 speakers of
Canadian English and Japanese. In: Proceedings of the 32nd Boston University
Conference on Language Development, pp. 155–166. Citeseer (2008)
13. Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference
on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)
14. Kim, H., Park, J.S.: Automatic language identification using speech rhythm fea-
tures for multi-lingual speech recognition. Appl. Sci. 10(7) (2020). https://doi.org/
10.3390/app10072225, https://www.mdpi.com/2076-3417/10/7/2225
15. Lee, C.S., Todd, N.P.M.: Towards an auditory account of speech rhythm: appli-
cation of a model of the auditory ‘primal sketch’to two multi-language corpora.
Cognition 93(3), 225–254 (2004)
16. Ling, L.E., Grabe, E., Nolan, F.: Quantitative characterizations of speech rhythm:
syllable-timing in Singapore English. Lang. Speech 43(4), 377–401 (2000). https://
doi.org/10.1177/00238309000430040301, pMID: 11419223
17. Loukina, A., Kochanski, G., Rosner, B., Keane, E., Shih, C.: Rhythm measures and
dimensions of durational variation in speech. J. Acoust. Soc. Am.129(5), 3258–3270
(2011). https://doi.org/10.1121/1.3559709
18. Murty, L., Otake, T., Cutler, A.: Perceptual tests of rhythmic similarity: I. mora
rhythm. Lang. Speech 50(1), 77–99 (2007)
19. Nolan, F., Asu, E.L.: The pairwise variability index and coexisting rhythms in
language. Phonetica 66(1–2), 64–77 (2009)
20. Pegu, J.: Morpho-syntactic variation in the pagro and sa:jan dialects of the mising
community. North East Indian Linguist. 3, 155–170 (2011)
21. Pike, K.: The Intonation of American English. University of Michigan Press, Ann
Arbor, MI, USA (1945)
22. Port, R.F., Dalby, J., O’Dell, M.: Evidence for mora timing in Japanese. J. Acoust.
Soc. Am. 81(5), 1574–1585 (1987)
23. Pradhan, G., Prasanna, S.R.M.: Speaker verification by vowel and nonvowel like
segmentation. IEEE Trans. Audio Speech Lang. Process. 21(4), 854–867 (2013)
24. Prasad, B.: Mising grammar. Mysore, Central Institute of Indian languages (CIIL)
Eds: Sastry and Abraham (1991)
25. Ramus, F., Nespor, M., Mehler, J.: Correlates of linguistic rhythm in the speech
signal. Cognition 75(1), AD3-AD30 (2000)
26. Rathcke, T.V., Smith, R.H.: Speech timing and linguistic rhythm: on the acoustic
bases of rhythm typologies. J. Acoust. Soc. Am. 137(5), 2834–2845 (2015). https://
doi.org/10.1121/1.4919322
27. Rouas, J.L., Farinas, J., Pellegrino, F., André-Obrecht, R.: Rhythmic unit extrac-
tion and modelling for automatic language identification. Speech Commun. 47(4),
436–456 (2005)
28. Taid, T.: A short note on mising phonology. Linguistics of the Tibeto-Burman Area
10.1 (1987)
29. White, L., Mattys, S.: Rhythmic Typology and Variation in First and Second
Languages, pp. 237–257 (2007). https://doi.org/10.1075/cilt.282.16whi
30. Wiget, L., White, L., Schuppler, B., Grenon, I., Rauch, O., Mattys, S.L.: How
stable are acoustic metrics of contrastive speech rhythm? J. Acoust. Soc. Am.
127(3), 1559–1569 (2010)