0% found this document useful (0 votes)
14 views13 pages

Springer SPECOM Parismita

Uploaded by

parisangel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views13 pages

Springer SPECOM Parismita

Uploaded by

parisangel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Automatic Rhythm and Speech Rate

Analysis of Mising Spontaneous Speech

Parismita Gogoi1,3(B) , Priyankoo Sarmah1 , and S. R. M. Prasanna2


1
Indian Institute of Technology Guwahati, Guwahati 781039, India
{parismitagogoi,priyankoo}@iitg.ac.in
2
Indian Institute of Technology Dharwad, Dharwad 580011, India
prasanna@iitdh.ac.in
3
DUIET, Dibrugarh University, Dibrugarh 786004, India

Abstract. The objective of this current study is to analyse rhythm mea-


sures and speech rate of Mising, a low resource language spoken in the NE
region of India. In this work, two Mising dialects - Pagru and Delu - are
considered, along with Assamese to study their rhythmic characteristics
in the case of spontaneous speech. Rhythm metric measures are generally
computed from the annotated speech regions. However, for spontaneous
speech, hand-annotation may be difficult and time-consuming. There-
fore, this work explores the vowel onset point and offset point detection
algorithm to automate the rhythm measure calculation process. Results
show that Mising varieties are stressed-timed as compared to mora-timed
Assamese language. Finally, automatically computed rhythm measures
and speech rate are explored for classifying Assamese and Mising using
machine learning models.

Keywords: Rhythm · Speech rate · SVM · Random forest · Mising

1 Introduction
Over the years, linguistic rhythm research has been focused on the idea that
acoustic measures of duration of vowels and consonants can be assessed in order
to classify languages according to rhythmic templates. Based on the rhythm
class hypothesis, languages can be classified as syllable-timed, stress-timed, and
mora-timed [1,7,18,21,22]. Rhythm metrics are defined as the formulas that
quantify vocalic and consonantal variability used for topological studies in clas-
sifying languages rhythmically. The stress-, syllable- and mora-timing distinc-
tion in languages is quantified based on vowels and consonants duration, rather
than syllables or stress feet. It is reported that Spanish (syllable-timed) has
less complex consonant clusters and less vowel reduction as compared to English
(stress-timed) [5]. Low et al. demonstrated that identification of vowel/consonant
boundaries is rather straightforward compared to syllabification rules that differ
in world languages [16].
In 1999, Ramus et al. carried out a preliminary attempt to quantify conso-
nantal and vocalic variability by proposing the standard deviation of vocalic and
!c Springer Nature Switzerland AG 2022
S. R. M. Prasanna et al. (Eds.): SPECOM 2022, LNAI 13721, pp. 201–213, 2022.
https://doi.org/10.1007/978-3-031-20980-2_18
202 P. Gogoi et al.

consonantal interval duration (‘∆V’ and ‘∆C’ respectively) [25]. The percentage
of utterance duration that is vocalic rather than consonantal is termed as (%V).
Rhythm classification in increasing %V order was performed on previously clas-
sified languages (Dutch/English/Polish, Catalan/French/Italian/Spanish, and
Japanese) to reflect statistically in the rhythm continuum, using a combination
of ∆C and %V. The pairwise variability indices nPVI and rPVI (pairwise com-
parisons of successive vocalic and intervocalic intervals) were later introduced
by Grabe and Low [11,16]. PVIs capture syntagmatic distinction over an utter-
ance by averaging vocalic/consonantal durational differences [19]. Low et al.
identified a Singaporean and a British dialect of English based on PVI-based
measures [16]. Speech rate (SR) tends to show a high correlation with interval
duration measures based on variance. It is reported that with a slower rate,
lengthening of intervals takes place [6]. Coefficients of variation for consonantal
intervals (VarcoC: [6]) and vocalic intervals (VarcoV: [9]) are also explored to
implement speech rate normalization. At best, metrics such as VarcoV and %V
are approximate indicators of broad phonetic and phonotactic patterns [6].
Using normalized PVI (nPVI)-based metrics, interval durations were normal-
ized to understand speech rate variation also [16]. Speech rate tends to show high
correlation with interval duration measures based on variance. It is reported that
with slower rate, lengthening of intervals take place [6]. Many authors presented
with newer methods of quantifying rhythmic distinctions [15,27]. Coefficients
of variation for consonantal intervals (VarcoC: [6]) and vocalic intervals (Var-
coV: [9]) are also explored to implement speech rate normalization. At best,
metrics such as VarcoV and %V are approximate indicators of broad phonetic
and phonotactic patterns [6].

1.1 Previous Work


The major limitation of working with large speech file sizes and the mixture of
elicitation methods such as story reading, spontaneous speech, and reading a
set of sentences is the laborious nature of the manual measurement of segment
duration. Also, there lie some language-specific biases in applying segmentation
criteria [17]. The biases are also subjected to the annotator’s expertise in the
language. Several measures for rhythm analysis are conventionally derived by
annotating each phonetic unit with time-stamps and interval durations using a
speech tool, such as Praat [3], which may be time-consuming and cumbersome.
Previously, automated approaches were adopted for calculating segment dura-
tions using data-trained models for the recognition and forced alignment [30].
However, such methods are not very successful as forced alignment is available
only for a few languages and they address purely acoustic-based automatic anno-
tation of speech files [17]. A completely automated sonority estimation technique
based on spectrogram was employed [10], which mapped spans of the speech
file indicated by the entropy from one time-stamp to the next. This method
also required pre-labeling the acoustic signal as Cs and Vs before final measure-
ments. This method of automatic sonority estimation could rhythmically classify
eight languages (English, Dutch, Spanish, Italian, French, Catalan, Italian, and
Automatic Rhythm and Speech Rate Analysis of Mising Spontaneous Speech 203

Japanese) with results that were comparable to Ramus et al. (1999) [25]. In
an approach, using the knowledge of loudness and periodicity, annotations of
acoustic waveforms were carried out for calculating units relevant for rhythmic
identification [17]. Fifteen Rhythm metrics, including %V, ∆C, and PVIs, were
measured to rhythmically classify Russian, Greek, and Taiwan Mandarin, South-
ern British English, French languages. The accuracy rates were found to be in
the range of 34% to 43% above chance. Another method utilized an automatic
vowel detection algorithm [27], originally applied in a language identification
task [8]. The methodology could classify the vowel and non-vowel portions and
parse the utterance string into pseudo syllables. Another work has been reported
to demonstrate a rhythm metric-based LID approach, using GMM based learn-
ing [14] for a multi-lingual system. Total consonant cluster duration, total vowel
duration, and complexity of the consonantal cluster (i.e., number of the conso-
nants in the cluster) were computed from each pseudo-syllable. Then, 7 languages
(English, German, Mandarin. French, Italian, Spanish, and Japanese) were clas-
sified with mean correct rhythmic accuracy from 80% for mora-timed to 92% for
stress-timed languages using Gaussian Mixture Models (GMM). Another work
has been reported to demonstrate a rhythm metric-based LID approach, using
GMM based learning [14] for a multi-lingual system.

1.2 Motivation and Contribution

Our current work aims at exploring rhythmic variability in the Mising lan-
guage of the Eastern Tani sub-group of the Tibeto-Burman language family
in North-East India [24]. Most of the Tibeto-Burman (TB) languages are yet
to be fully documented or scientifically described. Eight Mising dialects are tra-
ditionally recognized: Pagro, Delu, Sayang, Oyan, Moying, Dambuk, Somuwa,
and Samuguria [28]. Despite having a large number of native speakers of different
dialects, the phonetics of prosody in Mising has not previously been investigated
by researchers. There is next to no work on Mising speakers’ rhythm analysis in
dialectal data. And therefore, rhythmic characteristics and its difference among
the dialects are largely unknown.
We focus on identifying the rhythm class of Mising as per the rhythm class
hypothesis in our preliminary attempt, using automatic computation of rhythm
measures. This investigation will shed light on many questions on the stress,
rhythm, and prosodic behavior of Tibeto-Burman languages, which are still
important and open to understanding. Considering the limitations pointed out
earlier regarding the manual assessment of rhythm measures, we employ an auto-
mated method of vowel onset (VOP) and offset points (VEP) detection from
acoustic signals for measuring the interval durations and utilize these values to
compute rhythm measures [23]. The rhythm measures are calculated from conso-
nantal and vocalic intervals derived from the VOP and VEP detection method.
This automated method will be helpful compared to transcription-based forced
alignment techniques, where linguistic knowledge is a must for processing the
204 P. Gogoi et al.

Table 1. Database details.

Language/Dialect Gender Age (µ ± σbf) #tokens


M F
Assamese (A) 3 6 22.55 ± 1.25 392
Mising - Pagro (P) 4 6 34.31 ± 7.69 193
Mising - Delu (D) 2 7 30.07 ± 11.14 209
Total 9 19 28 (Total speaker) 794

audio files. An LID approach is proposed between Mising and Assamese lan-
guages using the speech rhythm measures and speech rate as a feature set. This
work studies the rhythm characteristics of two Mising dialects, namely Pagro
and Delu, using spontaneous speech data. Assamese, spoken by about 15 million
people in the state of Assam, is an Indo-European language. Assamese data is
also considered for language identification tasks with Mising. The rhythm mea-
sures of Assamese are comparable to that of the mora-timed languages, such as
Japanese [7].
To summarize, the following are the contributions of the present paper-

1. Classification of Mising is conducted as per rhythm class hypothesis, which


is hitherto un-analyzed for rhythm.
2. Automatic annotation based on VOP and VEP is carried out for calculating
interval durations in rhythm measure calculation.
3. Language identification of two under-resourced languages viz. Mising vs.
Assamese has been conducted using rhythm measures and speech rate.

2 Database Preparation

In order to study the rhythm of Mising, spontaneous speech of native speakers


of Pagro (P) and Delu speakers (D) were recorded in a noiseless environment.
In this work, Assamese spontaneous speech has been collected from Assamese
(A) speakers residing in the upper Assam region. The recordings were done in a
noiseless environment using a Zoom H1n recorder. The sampling frequency was
kept at 44.1 kHz and 16 bit in .wav format. The data recorded was segmented
into short utterances in Praat 6.0.35 [3].

2.1 Speakers

In this work, a speech database is prepared from a total of 28 speakers belonging


to the Assamese (A), Pagro (P), and Delu (D) groups. Data has been collected
from ten Pagro Mising speakers who reside mostly near to Jonai area of the
Dhemaji district. Delu Mising speakers are residing near the banks of the Burhi-
dihing river in Rajabari area of Sivasagar district. The recordings were completed
during the months of Jan.- Dec. 2020. Assamese speech has been collected from
Automatic Rhythm and Speech Rate Analysis of Mising Spontaneous Speech 205

Fig. 1. Box plots of the rhythm measures for two dialects of Mising and Assamese.

nine Assamese (A) speakers. Assamese has been explored as the control language
to understand the prosody of two broad categories of language, namely Indo-
European and Tibeto-Burman language (Mising), respectively. None of the 28
speakers reported any speech disorders during the recording. A detailed descrip-
tion of the database can be found in Table 1.

2.2 Materials

The principal elicitation method used in this study is spontaneous speech. Five
topics were provided to the speakers of both languages for recording. The topics
were mainly related to their daily livelihoods and ethnocultural practices of the
Assamese and Mising communities, respectively. The five topics for Assamese
were- Bihu festival, introduction to own village, weaving methods of Assamese
dresses, Assamese community, and ethnic food preparation methods. The five
topics in Mising were- Ali-Aye-Ligang festival, introduction to own village, weav-
ing methods of Mising dresses, Mising community, and ethnic food preparation
methods. The duration of the spontaneous speech recordings ranged from 1 min
to 4 mins.
206 P. Gogoi et al.

Table 2. Overall rhythm measures for Assamese (A), Pagro (P) and Delu (D).

Rhythm measures A (µ ± σ) P (µ ± σ) D (µ ± σ)
%V 43.04 ± 7.15 38.37 ± 8.21 41.51 ± 8.13
∆V 0.05 ± 0.02 0.086 ± 0.02 0.093 ± 0.029
∆NV 0.17 ± 0.08 0.292 ± 0.20 0.26 ± 0.14
Varco-V 52.25 ± 16.30 68.86 ± 14.10 70.12 ± 13.79
Varco-NV 98.61 ± 28.66 132.55 ± 29.25 126.06 ± 31.49
nPVI-V 46.19 ± 10.48 59.34 ± 8.96 59.86 ± 9.20
nPVI-NV 66.58 ± 14.63 81.10 ± 14.55 78.02 ± 12.96
SR 3.45 ± 0.72 2.89 ± 0.74 2.75 ± 0.63

3 Methodology

This section discusses the methodology to automatically derive the rhythm mea-
sures, by segmenting the vowel (V) regions and non-vowel (NV) regions. Here,
V-region includes the vowels present in the speech signal, whereas, NV corre-
sponds to the consonants and pauses. The segmentation of V and NV regions
are performed using an algorithm proposed in [23]. This method uses excitation
source information, such as zero-frequency filtered signal and Hilbert envelope of
the linear prediction residual to locate the VOP and VEP in a speech signal [23].
We have considered speech region from one VOP to adjacent VEP as the V-
region, whereas, one VEP to adjacent VOP is considered as the NV-region. For
more detailed description interested reader can refer to the original paper [23].
The steps to derive the rhythm measures automatically are summarized as given
below.

– Detection of VLRs (V) and non-VLRs (NV) in the speech signal.


– Compute the duration of each VLR and non-VLR.
– Compute the rhythm measure and speech rate using rhythm metric formulas.

To analyze rhythm, five interval measures, namely, %V, ∆V, ∆NV, Varco-V
and Varco-NV are calculated. %V is the percentage of V-regions in an utter-
ance. ∆NV is the standard deviation of the duration of NV-regions, and ∆V is
the standard deviation of the duration of V-regions [25]. Varco-NV is defined
as the percentage of the standard deviation of NV interval duration (∆NV)
of the average duration of V-regions (mean NV). Similarly, Varco-V is calcu-
lated from the V-regions. Two Pairwise Variability Index measures, nPVI-V and
nPVI-NV are evaluated following [19]. nPVI-V is the rate normalized measure of
the durational variation of two consecutive V-regions, and nPVI-NV is the rate
normalized measure of the durational variation of two consecutive NV-regions.
Automatic Rhythm and Speech Rate Analysis of Mising Spontaneous Speech 207

4 Experiments and Results


4.1 Rhythm Metrics
Figure 1 shows the box plots of the eight rhythm measures and SR for two
dialects of Mising and Assamese. Table 2 shows values of the rhythm measures
calculated for spontaneous speech for Assamese (A), Pagro (P), and Delu (D).
%V and ∆NV are directly related to the syllabic structure. A higher ∆NV means
a greater variability in the number of consonants, referring to a language that
may instantiate more syllable types. From the table, we can see that Mising
having more syllable types, shows lower %V and higher ∆NV. Higher value of
∆V for Mising can be interpreted due to the combination of several phonologi-
cal factors present in the language. Contrastive vowel length, vowel lengthening
in specific contexts and long vowels, influence the vocalic interval variability.
We have plotted the %V and nPVI-V values obtained from Assamese speakers
and two varieties of Mising speakers. In order to compare them with proto-
typical syllable-timed, stress-timed and mora-timed languages, we have plotted
the measures from British English, Spanish and Japanese, obtained from pre-
vious observations [11,12,29]. Figure 2 shows the distribution of languages on
a %V and nPVI-V plane, supporting the notion of stress-, syllable- and mora-
timed languages. Languages like Spanish do not show such type of phonological
phenomenon, and hence show a lesser value of ∆V [5]. Syllable-timed languages
typically have a higher standard deviation of C-intervals (∆NV) and a lower per-
centage of time over which speech is vocalic (%V) than stress-timed languages.
As observed from the figure, the Assamese gets clustered closer to the Japane-
se. This result is in agreement with a previous study, where authors have used

Fig. 2. Distribution of languages over the (%V, nPVI-V) plane.


208 P. Gogoi et al.

Fig. 3. Distribution of languages over the (Varco-V, nPVI-V) plane.

manual annotation-based computation for five Assamese varieties [7]. Pagro and
Delu are placed closer to British English. From this plot, stress- timing is found
to be dominant in two varieties of Mising. It is believed that in stressed timed
languages, codas and consonant clusters contribute to a greater consonantal
portion of the signal [26]. In Mising, seven long and seven short vowels are
present, which is a marker of rhythm [28]. Additionally, diphthongization is more
robust on long vowels [20] and the presence of long vowels in a language is
connected with greater durational variability. Greater durational variability is
reported in stress timed language due to vowel reduction, which is measured
from metrics PVI, DeltaV, Varco-V. Mora timed languages generally exhibit
a simpler syllable structure. Researchers have provided significant attention in
the possibility of an interaction between suprasegmental rhythm and vocabulary
systems. Many works have been concentrated on criteria of segmental phonology
to link to rhythm class hypothesis [26].
In the process of capturing hypothesized rhythm class in language continuum,
rate normalized metrics based on vowels, Varco-V and nPVI-V are found to be
most reliable. Hence, Varco-V values are plotted on the vertical axis against
nPVI-V values on the horizontal axis, as shown in Fig. 3. The PVI profiles depict
acoustic evidence for rhythmic differences between English, Mising (Pagro, Delu)
on the one hand, and Spanish on the other. Mora- timed Japanese and Assamese
are patterned between the stress-timed and syllable-timed language. Stressed
time language is said to exhibit more vocalic variability (high vocalic nPVI-V)
than syllable-timed languages related to vowel quality.
The findings of the present method seem to be consistent with the pre-
vious method reported in [7]. In Fig. 4, RM values computed using proposed
Automatic Rhythm and Speech Rate Analysis of Mising Spontaneous Speech 209

Fig. 4. Comparison of Rhythm measure (RM) values computed using proposed auto-
matic method for spontaneous speech recorded speakers from Upper Assam (Proposed
(UA-Spon)) and Praat-based manual annotation (Manual (JOR-Read) and Manual
(TIN-Read)) reported in [7] for read speech recorded from speakers from Jorhat and
Tinsukia, two district from Upper Assam region.

automatic method are compared with Praat-based manual annotation method


reported in [7]. The proposed automatic method is applied in spontaneous speech
recorded speakers from Upper Assam (Proposed (UA-Spon)). And Praat-based
syllable level manual annotation (Manual (JOR-Read) and Manual (TIN-Read))
is applied in read speech recorded from speakers from Jorhat and Tinsukia, two
major districts from Upper Assam region [7]. We have considered RMs obtained
from vowel duration statistics, such as %V, Varco-V, nPVI-V, and ∆V for com-
parison. The plots depict that RM values are comparable in both the approaches.

4.2 Statistical Analysis


To find if the computed rhythm measures and speech rate are different across
the two varieties of Mising and Assamese, we perform the Linear Mixed Effects
(LME) model [2] based statistical analysis. In the LME model, we have consid-
ered rhythm measures as the dependent variable, whereas gender and language
are the fixed factors. Speaker information is the random factor in our model.
Finally, we perform the Wald χ2 to see the effect of language and gender effect
on the rhythm measures. Table 3 summarized the results of the LME-based sta-
tistical tests for Mising vs. Assamese. The table shows that except for the %V,
all the rhythm measures and speech rate show significant p-values (<0.001).
However, gender has no effect on the rhythm measures. Also, we have found no
significant difference between the rhythm and speech rate values of the Mising
varieties considered.
210 P. Gogoi et al.

Table 3. Wald χ2 tests on LME models for rhythm measures (RM) and speech rate
(SR) for Assamese and Mising.

Measures Mixed factor χ2 p-value


%V Language 0.43 0.51
Gender 0.3 0.57
Language:gender 1.12 0.28
∆V Language 33.23 8.1E-09
Gender 3.87 0.04
Language:gender 3.64 0.05
∆NV Language 14.19 0.000165
Gender 1.58 0.2
Language:Gender 1.83 0.17
Varco-V Language 11.58 0.00066
Gender 0.31 0.57
Language:Gender 4.21 0.04
Varco-NV Language 17.01 0.0000371
Gender 1.79 0.18
Language:Gender 0.4 0.52
nPVI-V Language 45.58 1.46E-11
Gender 1.53 0.21
Language:Gender 3.51 0.06
nPVI-NV Language 22.2 0.00000245
Gender 0.07 0.79
Language:Gender 0.69 0.4
SR Language 22.28 0.00000234
Gender 0.67 0.41
Language:Gender 2.57 0.1

4.3 Automatic Language Identification Using Speech Rhythm


Features and Speech Rate

This work explores two machine learning models, viz. SVM and RF to investigate
the effectiveness of automatically computed Rhythm measures and speech rate
for classifying Mising (Pagro and Delu combined) and Assamese language. The
SVM maps the input features into high dimensional space so that features can
be linearly separable [4]. RF is an ensemble of decision trees, and its prediction
for input will be the class voted by most trees [13].
The models are trained using the 4-fold cross-validation and in a speaker-
independent manner. In each iteration, speech data from three folds (approx.
80% of total data) are used for training, and the remaining one fold is used
to evaluate the performance. The training set is further divided into the actual
training set, which is used to train the model, and a development set, which is
used to optimize the model’s hyperparameters. Thus, the models are evaluated
four times, and at each fold, accuracy and F1-score on the test set are considered
as evaluation metrics. We use the grid-search method to tune the hyperparame-
ters of SVM, such as C and γ, and RF, such as the number of trees. The results
Automatic Rhythm and Speech Rate Analysis of Mising Spontaneous Speech 211

Fig. 5. Feature importance computed from RF using the 4 fold cross-validation.

Table 4. Classification results of the 4-fold cross validation for Assamese and Mising.

Classifier SVM (RBF Kernel) RF


Features Dimension Accuracy (µ ± σ) F1-Score (µ ± σ) Accuracy (µ ± σ) F1-Score (µ ± σ)
RM 7 81.48 ± 7.30 81.08 ± 8.90 81.12 ± 7.17 80.94 ± 9.43
RM + SR 8 82.53 ± 7.62 81.96 ± 9.24 81.73 ± 7.70 81.27 ± 9.23
RM+ SR (except %V ) 7 83.10 ± 7.85 82.34 ± 9.21 81.55 ± 7.64 81.14 ± 9.95

are noted in terms of mean (µ), standard deviation (σ) accuracy, and the F1-
score of the four-folds. The speakers in the train set are not included in the
test set in each fold. Hence, the classification results are reported in a speaker-
independent manner. The classification results of the 4-fold cross-validation are
mentioned in Table 4. The models are developed for three different combinations
of features, such as RM, RM and SR, and RM and SR excluding %V. The aver-
age accuracy of 81.48% and average F1-score of 81.12% are observed in the case
of the RM-based SVM model. And similar performance can also be seen for the
RF classifier. Inclusion of SR with the RM provides around a 1% improvement
for the SVM. However, it is found that excluding the %V features improves the
SVM-based system performance. The table shows that RM and SR computed
automatically using the VOP and VEP detection can be utilized in classifying
Mising and Assamese. Figure 5 shows the contribution of each feature in classify-
ing Mising and Assamese. Rhythm measures related to the vowel regions (except
%V) are very important in the classification. Further investigation needs to be
carried to identify the possible cause of this trend.

5 Conclusion and Future Directions

This paper discusses a methodology for computing the rhythm measures and
speech rate by automatically locating VOP and VEP from spontaneous speech.
This automated method relies on acoustic information alone, which is bene-
ficial compared to labour-consuming manual annotation and forced alignment
212 P. Gogoi et al.

methods. We have analyzed the rhythm measures of Mising, a low resource lan-
guage spoken in Assam, and performed a comparative study with Assamese, the
official language of Assam. From the analysis, it is found that Mising is more
stress-timed on the language continuum. And Assamese falls in the mora-timed
language category, which has been validated as per previous studies [7]. A signif-
icant difference is observed between Mising and Assamese for all the measures,
except %V. However, between the two dialects of Mising, no statistical signif-
icance difference is noted, which can be seen from the eight feature boxplots
between Pagro and Delu types. LID systems are designed using machine learn-
ing models such as SVM and RF, considering combinations of rhythm measures
and speech rate. SVM-based system with 7- dimensional feature set seems to
provide the best accuracy of 83.10% and 82.34% F1-score.
We have observed significantly large values for the Varco-NV in Assamese
and two Mising varieties. The value of the Varco-NV for Assamese is found to
be high as compared to [7]. One possible reason may be due to the inclusion of
the silence region in the NV region. Future work is planned to further investigate
this measure in more detail. Moreover, current work only considers two Mising
dialects; hence, rhythm analysis of other Mising dialects is also planned in future
research to investigate between- and within-speaker rhythmic variability.

References
1. Abercrombie, D.: Elements of General Phonetics. Edinburgh University Press,
Edinburgh, Scotland (1980)
2. Bates, D., Mächler, M., Bolker, B., Walker, S.: Fitting linear mixed-effects models
using lme4. J. Stat. Softw. 67(1), 1–48 (2015)
3. Boersma, P., Weenink, D.: Praat: doing phonetics by computer (version 5.1.13)
(2009). http://www.praat.org
4. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297
(1995)
5. Dauer, R.M.: Stress-timing and syllable-timing reanalyzed. J. Phonet. 11(1), 51–62
(1983)
6. Dellwo, V., Wagner, P., Solé, M., Recasens, D., Romero, J.: Relations between
language rhythm and speech rate (2003)
7. Dihingia, L., Sarmah, P.: Rhythm and speaking rate in assamese varieties. In:
Proceedings of 10th International Conference on Speech Prosody 2020, pp. 561–
565 (2020)
8. Farinas, J., Pellegrino, F.: Automatic rhythm modeling for language identifica-
tion. In: Seventh European Conference on Speech Communication and Technology
(2001)
9. Ferragne, E., Pellegrino, F.: Rhythm in read British English: interdialect variability.
In: Eighth International Conference on Spoken Language Processing (2004)
10. Galves, A., Garcia, J., Duarte, D., Galves, C.: Sonority as a basis for rhythmic
class discrimination. In: Speech Prosody 2002, International Conference (2002)
11. Grabe, E., Low, E.L.: Durational Variability in Speech and the Rhythm Class
Hypothesis, pp. 515–546. De Gruyter Mouton (2008)
Automatic Rhythm and Speech Rate Analysis of Mising Spontaneous Speech 213

12. Grenon, I., White, L.: Acquiring rhythm: a comparison of l1 and l2 speakers of
Canadian English and Japanese. In: Proceedings of the 32nd Boston University
Conference on Language Development, pp. 155–166. Citeseer (2008)
13. Ho, T.K.: Random decision forests. In: Proceedings of 3rd International Conference
on Document Analysis and Recognition, vol. 1, pp. 278–282. IEEE (1995)
14. Kim, H., Park, J.S.: Automatic language identification using speech rhythm fea-
tures for multi-lingual speech recognition. Appl. Sci. 10(7) (2020). https://doi.org/
10.3390/app10072225, https://www.mdpi.com/2076-3417/10/7/2225
15. Lee, C.S., Todd, N.P.M.: Towards an auditory account of speech rhythm: appli-
cation of a model of the auditory ‘primal sketch’to two multi-language corpora.
Cognition 93(3), 225–254 (2004)
16. Ling, L.E., Grabe, E., Nolan, F.: Quantitative characterizations of speech rhythm:
syllable-timing in Singapore English. Lang. Speech 43(4), 377–401 (2000). https://
doi.org/10.1177/00238309000430040301, pMID: 11419223
17. Loukina, A., Kochanski, G., Rosner, B., Keane, E., Shih, C.: Rhythm measures and
dimensions of durational variation in speech. J. Acoust. Soc. Am.129(5), 3258–3270
(2011). https://doi.org/10.1121/1.3559709
18. Murty, L., Otake, T., Cutler, A.: Perceptual tests of rhythmic similarity: I. mora
rhythm. Lang. Speech 50(1), 77–99 (2007)
19. Nolan, F., Asu, E.L.: The pairwise variability index and coexisting rhythms in
language. Phonetica 66(1–2), 64–77 (2009)
20. Pegu, J.: Morpho-syntactic variation in the pagro and sa:jan dialects of the mising
community. North East Indian Linguist. 3, 155–170 (2011)
21. Pike, K.: The Intonation of American English. University of Michigan Press, Ann
Arbor, MI, USA (1945)
22. Port, R.F., Dalby, J., O’Dell, M.: Evidence for mora timing in Japanese. J. Acoust.
Soc. Am. 81(5), 1574–1585 (1987)
23. Pradhan, G., Prasanna, S.R.M.: Speaker verification by vowel and nonvowel like
segmentation. IEEE Trans. Audio Speech Lang. Process. 21(4), 854–867 (2013)
24. Prasad, B.: Mising grammar. Mysore, Central Institute of Indian languages (CIIL)
Eds: Sastry and Abraham (1991)
25. Ramus, F., Nespor, M., Mehler, J.: Correlates of linguistic rhythm in the speech
signal. Cognition 75(1), AD3-AD30 (2000)
26. Rathcke, T.V., Smith, R.H.: Speech timing and linguistic rhythm: on the acoustic
bases of rhythm typologies. J. Acoust. Soc. Am. 137(5), 2834–2845 (2015). https://
doi.org/10.1121/1.4919322
27. Rouas, J.L., Farinas, J., Pellegrino, F., André-Obrecht, R.: Rhythmic unit extrac-
tion and modelling for automatic language identification. Speech Commun. 47(4),
436–456 (2005)
28. Taid, T.: A short note on mising phonology. Linguistics of the Tibeto-Burman Area
10.1 (1987)
29. White, L., Mattys, S.: Rhythmic Typology and Variation in First and Second
Languages, pp. 237–257 (2007). https://doi.org/10.1075/cilt.282.16whi
30. Wiget, L., White, L., Schuppler, B., Grenon, I., Rauch, O., Mattys, S.L.: How
stable are acoustic metrics of contrastive speech rhythm? J. Acoust. Soc. Am.
127(3), 1559–1569 (2010)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy