0% found this document useful (0 votes)
18 views7 pages

2111.02735v3 - Speech Emotion Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views7 pages

2111.02735v3 - Speech Emotion Detection

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

A FINE-TUNED WAV2VEC 2.

0/HUBERT BENCHMARK FOR SPEECH EMOTION


RECOGNITION, SPEAKER VERIFICATION AND SPOKEN LANGUAGE UNDERSTANDING

Yingzhi Wang? Abdelmoumene Boumadane? Abdelwahab Heba?


?
Zaion Lab, Zaion, Paris, France

ABSTRACT learn good high-level representations of unmasked inputs in order to


infer the targets of masked ones correctly.
Speech self-supervised models such as wav2vec 2.0 and HuBERT
arXiv:2111.02735v3 [cs.CL] 3 Oct 2022

Wav2vec 2.0 and HuBERT outperformed all existing ASR mod-


are making revolutionary progress in Automatic Speech Recognition els at that time, proving that they can construct a better verbal em-
(ASR). However, they have not been totally proven to produce better bedding. However, speech also contains other important information
performance on tasks other than ASR. In this work, we explored par- such as emotion, speaker and semantics, for which the industry also
tial fine-tuning and entire fine-tuning on wav2vec 2.0 and HuBERT has high expectations. In the field of Speech Emotion Recogni-
pre-trained models for three non-ASR speech tasks: Speech Emo- tion (SER), Speaker Verification (SV) and Spoken Language Un-
tion Recognition, Speaker Verification and Spoken Language Under- derstanding (SLU), it is still vague whether self-supervised mod-
standing. With simple proposed downstream frameworks, the best els can produce better performance compared with traditional super-
scores reached 79.58% weighted accuracy on speaker-dependent set- vised models (spectral features + CNN-based feature extraction +
ting and 73.01% weighted accuracy on speaker-independent setting RNN/Transformer based time series modeling) [12, 13, 14, 15, 16].
for Speech Emotion Recognition on IEMOCAP, 2.36% equal error However, meaningful attempts have been made in some previous
rate for Speaker Verification on VoxCeleb1, 89.38% accuracy for In- works, which we will introduce below.
tent Classification and 78.92% F1 for Slot Filling on SLURP, show- In SUPERB [17], the performance of different frozen self-
ing the strength of fine-tuned wav2vec 2.0 and HuBERT on learning supervised encoders were benchmarked across a wide range of
prosodic, voice-print and semantic representations. speech tasks. For SER, the HuBERT large model stood out from
Index Terms— wav2vec 2.0, HuBERT, speech emotion recog- other self-supervised encoders with 67.62% accuracy (ACC) on
nition, speaker verification, spoken language understanding IEMOCAP [18]. For SV, the HuBERT base model obtained the best
Equal Error Rate (EER) 5.11% on VoxCeleb1 [19]. SLU contains
two separate subtasks: Intent Classification (IC) and Slot Filling
1. INTRODUCTION
(SF). The HuBERT large model achieved the best results on both
Nowadays, people are expecting less labeled data to train well- IC and SF tasks with 98.76% ACC on Fluent Speech Commands
generalized models for supervised tasks, since data labeling is a dataset [20] and 89.81% F1 score on SNIPS [21] respectively.
very time- and money-consuming process. Furthermore, people For SER, [22] combined the features from frozen wav2vec2.0
have been attempting to find a powerful feature embedding that can with other hand-crafted prosodic features and then fed them into a
assist the fine-tuning and multi-task training for downstream tasks. 1d-CNN for a deeper extraction. [23] explored wav2vec fine-tuning
The appearance of self-supervised learning meets the above two strategies and 65.4% WA on IEMOCAP was achieved. For SV, [24,
requirements exactly. In speech domain, excellent self-supervised 25] both explored fine-tuned wav2vec 2.0, [24] obtained 3.61% EER
models are emerging [1, 2, 3, 4, 5, 6, 7, 8, 9], among which the on VoxCeleb1, while [25] also obtained 1.91% EER on VoxCeleb1
most high-performing and the most widely used are wav2vec 2.0 by adding VoxCeleb2 into the training set.
[10] and HuBERT [11]. Many wav2vec 2.0/HuBERT pretrained We notice that self-supervised models were only used as frozen
models have also been published and this has greatly promoted their feature extractors in SUPERB and some other works. Believing that
applications in the field of speech. Therefore, we chose wav2vec2.0 only by fine-tuning can we show the real power of self-supervised
and HuBERT as our research objects in this work. models, we explored the fine-tuning of wav2vec2.0/HuBERT on
The wav2vec 2.0 model architecture contains mainly three three speech tasks and provided full fine-tuning experiment de-
modules. A convolutional neural network (CNN) feature encoder tails. Taking inspiration from [10] and [11], we added another
encodes the raw waveform inputs into latent speech representations. fine-tuning method by splitting a pre-trained wav2vec 2.0/HuBERT
Mask operations are applied before they are fed to the Transformer- model into two parts: the CNN feature encoder and the Transformer
based contextualized encoder. A quantization module is used to contextualized encoder. We froze the CNN feature encoder and
quantize the latent speech representations from the CNN encoder only fine-tuned the Transformer contextualized encoder. We then
into a discretized embedding which is then used as the target. The tested partially fine-tuned wav2vec2.0/HuBERT pre-trained models
objective is to optimize the contrastive loss, which enforces the together with the entirely fine-tuned ones with the following tasks
model to identify the true quantized latent speech representations. below:
HuBERT shares the same architecture as wav2vec 2.0. Instead • Speech Emotion Recognition on IEMOCAP
of constructing a contrastive loss, HuBERT uses an offline clustering • Speaker Verification on VoxCeleb1
step to generate noisy labels for Masked Language Model pretrain-
ing. Specifically, HuBERT consumes masked continuous speech • Spoken Language Understanding on SLURP [26]
features to predict predetermined cluster assignments. The predic- The results show that our fine-tuned models achieved excellent
tive loss is applied over the masked regions, forcing the model to results on the three tasks, which further proves their strong capac-
Fig. 1. Partial fine-tuning (left) and entire fine-tuning (right) of wav2vec 2.0/HuBERT.

ity on constructing problem-agnostic representations. The code and the pre-training and ASR fine-tuning of wav2vec 2.0, please refer to
fine-tuned models for SER and SLU have been open-sourced on [10].
SpeechBrain [27] 1 . In this work, we compare four released wav2vec 2.0 pre-trained
models: the wav2vec 2.0 base model (12 transformer blocks and 768
embedding dimension) and its ASR fine-tuned version, the wav2vec
2. METHOD
2.0 large model (24 transformer blocks and 1024 embedding dimen-
In this section, we will first introduce the pre-training of wav2vec sion) and its ASR fine-tuned version. Both base and large models
2.0/HuBERT model, then we will show our fine-tuning methods and are pre-trained on 960h LibriSpeech [31] data, which is also used
downstream models for each task. for their ASR fine-tuning. ASR fine-tuned models for both wav2vec
2.0 and HuBERT are taken into consideration because we assume
that some tasks may benefit from the ASR fine-tuning.
2.1. Pretrained wav2vec 2.0
The wav2vec 2.0 pre-training is similar to the masked language mod- 2.2. Pretrained HuBERT
elling in BERT [28] and is carried out under a self-supervised setting.
Contiguous time steps from the CNN encoder representations are In the same way as wav2vec 2.0, CNN-encoded audio features are
randomly masked, and the model is trained to reproduce the quan- randomly masked in HuBERT. To generate labels for the first iter-
tized local encoder representations for masked frames at the output ation of HuBERT pre-training, a k-means clustering is applied on
of the contextualized encoder. 39-dimensional MFCC features. To generate better targets for the
subsequent iterations, k-means clustering then works on the latent
exp(sim(ct , qt )/κ) features extracted from the HuBERT model pre-trained in the previ-
Lm = − log P (1)
q̃∈Qt (exp(sim(ct , q̃)/κ)
ous iteration. A projection layer is added over transformer blocks to
predict cluster labels. Cross-entropy loss is computed over masked
The training objective is illustrated in Eq.1, where sim(ct , qt ) is timestamps, which can be defined as:
the cosine similarity between the contextualized encoder outputs ct
and the quantized CNN encoder representations qt , t is the masked XX (k) (k)
Lm (f ; X, {Z (k) }k , M ) = log pf (zt |X,
e t) (2)
time step, Qt is the union of candidate representations q̃ which in-
t∈M k
cludes qt and K = 100 distractors, κ is the temperature which is
set to 0.1. The distractors are outputs of the local encoder sampled M ⊂ [T ] denotes the set of indices to be masked for a length-T
from masked frames belonging to the same utterance as qt . The con-
sequence X, and X e = r(X; M ) denotes a corrupted version of
trastive loss is then given by Lm summed over all masked frames.
X where xt is replaced with a mask embedding x e if t ∈ M . A
At the end, an L2 regularization is added to the contrastive loss, as
well as a diversity loss to increase the use of the quantized codebook masked prediction model f takes as input X e and predicts a distribu-
representations. tion over the target indices at each timestep pf (·|X;
e t). To improve
The pre-training process is optimized with Adam [29] and the target quality, cluster ensembles are utillized in case that an individ-
learning rate decays linearly after a waming up. In [10], wav2vec 2.0 ual clustering model performs badly, Z (k) then denotes the target
is also fine-tuned on ASR aiming to improve ASR performance. For sequences generated by the k-th clustering model.
ASR fine-tuning, a randomly initialized linear projection is added HuBERT pre-training uses the same optimizer and learning rate
to the output of the contextual encoder and the CTC (Connectionist scheduler as wav2vec 2.0. For ASR fine-tuning, the projection layer
Temporal Classification [30]) loss is minimized. For more details of is removed and replaced by a randomly initialized softmax layer,
then the CTC loss is optimized. For more details of the pre-training
1 https://github.com/speechbrain/speechbrain/tree/develop/recipes of HuBERT, please refer to [11].
cosine-similarity scores are then produced for SV on the pre-trained
SID embeddings before the linear classification layer.
For SLU (Fig.2), another attentional GRU-based decoder is
added to decode semantic information directly from the fine-tuned
wav2vec2.0/HuBERT embedding. In our work, intents and slots
are both treated as a sequence-to-sequence ASR task and are both
decoded from the attentional decoder. The Negative Log-Likelihood
(NLL) loss is then calculated over a character-level token genera-
tion. Following the observations of [33], we utilized a beam-search
with a beam of 80 without coverage penalty to identify the optimum
sequence for validation set and test set.

3. EXPERIMENTS

In the experiment section, we will first introduce the datasets used


for the three tasks, then we will list the models we compared and
add details of our experiment settings. Finally, we will show the
Fig. 2. Simple downstream models for SER, SID and SLU. For SER results for each task and compare with the existing state-of-the-art
and SID, an average time pooling and a linear classifier is built over baselines.
wav2vec 2.0/HuBERT. For SLU, an attentional decoder decodes in-
tents and slots directly from the fine-tuned wav2vec2.0/HuBERT 3.1. Datasets
embedding.
The three most widely used and most representative datasets were
chosen in our experiments, which are IEMOCAP for SER, Vox-
Celeb1 for SV and SLURP for SLU.
Like wav2vec 2.0, we compare three released HuBERT pre- The Interactive Emotional Dyadic Motion Capture (IEMOCAP)
trained models: the HuBERT base model (12 transformer blocks dataset has approximately 12 hours of data and consists of scripted
and 768 embedding dimension, of which no ASR fine-tuned version and improvised dialogues by 10 speakers. In order to form a contrast
is released), the HuBERT large model (24 transformer blocks and in this work, we used 4 emotional classes as in SUPERB: anger, hap-
1024 embedding dimension) and its ASR fine-tuned version. The piness, sadness and neutral, following the work of [34]. The evalu-
HuBERT base model is pre-trained on 960h LibriSpeech data, while ation metric is weighted accuracy (WA) and the experiments were
the large model is pre-trained on 60k hours Libri-Light [32] data. carried out on two different split settings: Speaker-Dependent (SD)
The ASR fine-tuning is also based on 960h LibriSpeech data. setting and Speaker-Independent (SI) setting. For SD, the results
were averaged on 5 different random seeds for train-validation-test
2.3. Fine-tuning split. For SI, a 10-fold cross-validation was performed with a leave-
two-speaker-out strategy (one for validation and one for test).
As is shown in Figure 1 on the left for partial fine-tuning, the VoxCeleb1 contains over 100,000 utterances from 1,251 speak-
wav2vec 2.0/HuBERT model is divided into two parts: the CNN- ers, with approximately 351 hours of audio in total. In our work, a
based feature encoder and the transformer-based contextualized Speaker Identification task was first implemented, encouraging the
encoder. We froze the CNN-based feature encoder, fixing all the model to learn to distinguish 1211 different voice-prints. A verifi-
parameters of these CNN blocks, and only fine-tuned the parameters cation was then carried out on the vox1-o test set of 40 speakers by
of the transformer blocks. Partial fine-tuning can be understood as a calculating cosine similarity on the embeddings from the pre-trained
domain adaptation training for the top level, which aims to prevent Speaker Identification model. VoxCeleb2 and noise augmentation
interference and damage to the bottom CNN layers that already have were not used in our experiments. We used equal error rate (EER)
an expressive ability. as the evaluation metric and the results were averaged on 5 different
For entire fine-tuning which is shown on the right in Figure 1, the seeds for train-validation split.
CNN and Transformer modules are both fine-tuned during the down- The Spoken Language Understanding Resource Package (SLURP)
stream training process. By training general features at the bottom Dataset is a collection of 72K audio recordings of single turn user
level, entire fine-tuning allows higher-level expressions to be more interactions with a home assistant, annotated with three levels of se-
complete and more targeted. mantics: Scenario, Action and Entities. The training and evaluation
Then, assuming that fine-tuned wav2vec 2.0/HuBERT are al- are based on its official training, validation and test sets.
ready powerful enough to capture information, we directly added
simple downstream adaptors (classifier/decoder) to wav2vec 2.0/Hu- 3.2. Fine-tuning settings
BERT without adding another heavy and redundant encoder. The
downstream adaptors for each task are presented as below. We rename the models we compare with a method as below.
For SER, an average time pooling and one linear layer are added • EF/PF/Frozen: Entirely Fine-tuned/Partially Fine-tuned/Not
as a simple downstream classifier (Fig.2). The average time pooling fine-tuned
compresses variant time lengths into one, then the linear layer effec- • w2v/hbt: wav2vec 2.0/HuBERT based model
tuates an utterance-level classification minimizing the cross-entropy
loss. • base/large: base/large pre-trained model
For SV, a Speaker Identification (SID) task is first imple- • -/960h: with/without ASR fine-tuning using 960h Lib-
mented using the same downstream framework as SER. Pairwise riSpeech data
For example, EF-w2v-base refers to an entirely fine-tuned on the results of fine-tuned models over those of SUPERB’s non-
wav2vec 2.0 base model, while PF-hbt-large-960h refers to a par- fine-tuned models. Then, for SER, we are surprised to find that all of
tially fine-tuned HuBERT large model with an ASR fine-tuning. For our fine-tuned models performed well, where the partially fine-tuned
more detailed parameters of released pre-trained wav2vec 2.0/Hu- HuBERT large model reached a best WA of 79.58% for SER-SD and
BERT models, please refer to [10] and [11]. a best WA of 73.01% for SER-SI, improving by 3.40% and 1.26% on
During the fine-tuning process, we applied two different sched- the state-of-the-art baselines respectively. Moreover, we observe that
ulers to respectively adjust the fine-tuning learning rate of the partial fine-tuning appeared to be a better fine-tuning method than
wav2vec 2.0/HuBERT encoder and the learning rate of the down- entire fine-tuning. We consider that IEMOCAP is a small dataset
stream model. Both the schedulers use an Adam Optimizer and with only 12 hours of data and training too many parameters may
linearly anneal the learning rates according to the performance of easily cause an overfitting. Additionally, we noticed that the ASR
validation stage. For SER and SV, the initialized fine-tuning learning fine-tuning was not helping the downstream SER task, suggesting a
rate and the downstream learning rate are set to 10−5 and 10−4 . For loss of prosodic information during the ASR fine-tuning.
SLU, these values are set to 10−5 and 3 × 10−4 . In the case of SV, the entirely fine-tuned HuBERT with ASR
fine-tuning reached a best 2.36% Equal Error Rate, surpassing the
baseline by 0.78%. However, contrary to SER, entire fine-tuning
Table 1. Benchmark results for two utterance-level tasks: Speech outperforms partial fine-tuning as can be seen from the results. Due
Emotion Recognition (SER) on weighted accuracy (WA%) and to the large amount of data (351 hours) of VoxCeleb1 that are also
Speaker Verification (SV) on Equal Error Rate (EER%). In rela- acoustically similar to the data used for pre-training, the pre-trained
tion to SER, it is divided into SER-SD (Speaker-Dependent settng) encoder parameters provide an ideal initialization for the down-
and SER-SI (Speaker-Independent setting). stream SV task, releasing all the layers and fine-tuning with a low
Model SER-SD SER-SI SV learning rate can lead to a good result. Finally, we find that Hu-
BERT turned out to be a better self-supervised encoder compared to
EF-w2v-base 75.90 70.75 2.77 wav2vec 2.0 for both SER and SV tasks.
PF-w2v-base 77.02 70.21 3.15
EF-w2v-base-960h 73.64 64.20 4.46
PF-w2v-base-960h 73.84 68.34 4.38 Table 2. Benchmark results for Spoken Language Understanding:
EF-w2v-large 77.00 70.96 3.42 Intent Classification (IC) on accuracy (ACC%) and Slot Filling (SF)
PF-w2v-large 77.47 70.99 3.85 on F1 score (F1%).
EF-w2v-large-960h 73.00 68.18 4.27 Model IC (ACC%) SF (F1%)
PF-w2v-large-960h 76.75 69.08 4.47 EF-w2v-base 87.13 74.32
EF-hbt-base 76.53 69.83 2.84 PF-w2v-base 86.58 74.73
PF-hbt-base 76.60 69.68 3.13 EF-w2v-base-960h 85.89 74.33
PF-w2v-base-960h 86.13 73.78
EF-hbt-large 78.52 72.31 2.86
PF-hbt-large 79.58 73.01 3.21 EF-w2v-large 85.80 72.45
EF-hbt-large-960h 78.78 72.71 2.36 PF-w2v-large 86.29 73.16
PF-hbt-large-960h 78.96 72.98 2.38 EF-w2v-large-960h 86.10 73.39
PF-w2v-large-960h 86.35 74.03
Frozen-w2v-base[17] - 63.43 6.02
Frozen-w2v-large[17] - 65.64 5.65 EF-hbt-base 87.44 75.06
Frozen-hbt-base[17] - 64.92 5.11 PF-hbt-base 87.51 75.32
Frozen-hbt-large[17] - 67.62 5.98 EF-hbt-large 89.38 78.43
Head Fusion[35] 76.18 - - PF-hbt-large 89.22 78.92
Attention Pooling[36] - 71.75 - EF-hbt-large-960h 88.71 78.89
Siamese Capsule[37] - - 3.14 PF-hbt-large-960h 88.32 78.17
Frozen-w2v-base 47.15 37.66
Frozen-w2v-large 3.88 3.85
Frozen-hbt-base 68.74 57.06
3.3. Results and discussion Frozen-hbt-large 74.42 60.07
3.3.1. Speech Emotion Recognition & Speaker Verification CTI[38] 86.92 74.66

The results of SER and SV are shown in Table 1. For comparison, we


show SUPERB’s results as a non-fine-tuned baseline (marked with
[17] in Table 1). Furthermore, we took Head-Fusion ACNN [35] 3.3.2. Spoken Language Understanding
for SER-SD (Speaker-Dependent setting), Attention Pooling based
representation [36] for SER-SI (Speaker-Independent setting) and For SLU, the results of its two subtasks are shown in Table 2. Like-
Siamese Capsule network [37] for SV as state-of-the-art baselines wise, we carried out experiments of frozen wav2vec 2.0/HuBERT
respectively. Compared with other more recent works, [36] provides models to form a contrast. However, the performance of frozen
a comparable result by reporting a competitve Weighted Accuracy models on this task drops significantly, especially for wav2vec 2.0
using only speech, and [37] also provides a comparable result using large model the loss cannot even converge, demonstrating that the
only Voxceleb1 as the training set (the same as SUPERB). First of frozen wav2vec 2.0/HuBERT cannot hold complete semantic infor-
all, from an overall perspective, we notice a significant improvement mation. Continuous Token Interface [38] is chosen as the state-of-
the-art baseline. The best ACC for IC is 89.38% with the entirely
fine-tuned HuBERT large model, while the partially fine-tuned Hu-
BERT large model reached the best F1 78.92% for SF. For SLU, the
gap between the two fine-tuning methods is not obvious. A slight
drop is observed on ASR fine-tuned models, which implies that ASR
fine-tuning will also result in a loss of semantic information.

4. CONCLUSIONS

In this work we explored different fine-tuning methods on two of the


most powerful self-supervised models (wav2vec 2.0 and HuBERT),
then benchmarked their performance on Speech Emotion Recog-
nition, Speaker Verification and Spoken Language Understanding
tasks. State-of-the-art results were achieved for all the three tasks,
proving the excellent generalizability of wav2vec 2.0/HuBERT on
learning prosodic, voice-print and semantic representations. We
hope to show the broad prospects of self-supervised learning and
also provide some useful insights for its industrial applications.
5. REFERENCES Liu, et al., “Speech emotion recognition using capsule net-
works,” in ICASSP 2019-2019 IEEE International Conference
[1] Yu-An Chung, Wei-Ning Hsu, Hao Tang, and James Glass, on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
“An unsupervised autoregressive model for speech represen- 2019, pp. 6695–6699.
tation learning,” in Interspeech, 2019.
[14] Gautam Bhattacharya, Md Jahangir Alam, and Patrick Kenny,
[2] Alexander H. Liu, Yu-An Chung, and James Glass, “Non- “Deep speaker embeddings for short-duration speaker verifica-
Autoregressive Predictive Coding for Learning Speech Rep- tion.,” in Interspeech, 2017, pp. 1517–1521.
resentations from Local Dependencies,” in Proc. Interspeech
[15] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck,
2021, 2021, pp. 3730–3734.
“ECAPA-TDNN: emphasized channel attention, propagation
[3] Andy T Liu, Shu-wen Yang, Po-Han Chi, Po-chun Hsu, and and aggregation in TDNN based speaker verification,” in In-
Hung-yi Lee, “Mockingjay: Unsupervised speech representa- terspeech. 2020, pp. 3830–3834, ISCA.
tion learning with deep bidirectional transformer encoders,” in
[16] Dmitriy Serdyuk, Yongqiang Wang, Christian Fuegen, Anuj
ICASSP 2020-2020 IEEE International Conference on Acous-
Kumar, Baiyang Liu, and Yoshua Bengio, “Towards end-to-
tics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp.
end spoken language understanding,” in 2018 IEEE Interna-
6419–6423.
tional Conference on Acoustics, Speech and Signal Processing
[4] Andy T Liu, Shang-Wen Li, and Hung-yi Lee, “Tera: Self- (ICASSP). IEEE, 2018, pp. 5754–5758.
supervised learning of transformer encoder representation for
[17] Shu-wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Lai,
speech,” IEEE/ACM Transactions on Audio, Speech, and Lan-
Kushal Lakhotia, Yist Lin, Andy Liu, Jiatong Shi, Xuankai
guage Processing, vol. 29, pp. 2351–2366, 2021.
Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng,
[5] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-
Swietojanski, Joao Monteiro, Jan Trmal, and Yoshua Bengio, Wen Li, Shinji Watanabe, Abdelrahman Mohamed, and Hung-
“Multi-task self-supervised learning for robust speech recog- yi Lee, “Superb: Speech processing universal performance
nition,” in ICASSP 2020-2020 IEEE International Conference benchmark,” in Proc. Interspeech 2021, 08 2021, pp. 1194–
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1198.
2020, pp. 6989–6993.
[18] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe
[6] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N
Michael Auli, “wav2vec: Unsupervised Pre-Training for Chang, Sungbok Lee, and Shrikanth S Narayanan, “Iemocap:
Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. Interactive emotional dyadic motion capture database,” Lan-
3465–3469. guage resources and evaluation, vol. 42, no. 4, pp. 335–359,
[7] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self- 2008.
supervised learning of discrete speech representations,” in In- [19] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman,
ternational Conference on Learning Representations (ICLR), “Voxceleb: A large-scale speaker identification dataset,” in
2020. Interspeech. 2017, pp. 2616–2620, ISCA.
[8] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shu- [20] Natalia Tomashenko, Antoine Caubrière, Yannick Estève, An-
jie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yosh- toine Laurent, and Emmanuel Morin, “Recent advances in
ioka, Xiong Xiao, et al., “Wavlm: Large-scale self-supervised end-to-end spoken language understanding,” in International
pre-training for full stack speech processing,” IEEE Journal of Conference on Statistical Language and Speech Processing.
Selected Topics in Signal Processing, 2022. Springer, 2019, pp. 44–55.
[9] Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie [21] Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche,
Liu, Furu Wei, Michael Zeng, and Xuedong Huang, “Unis- Alexandre Caulier, David Leroy, Clément Doumouro, Thibault
peech: Unified speech representation learning with labeled Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, et al.,
and unlabeled data,” in International Conference on Machine “Snips voice platform: an embedded spoken language under-
Learning. PMLR, 2021, pp. 10937–10947. standing system for private-by-design voice interfaces,” arXiv
preprint arXiv:1805.10190, 2018.
[10] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and
Michael Auli, “wav2vec 2.0: A framework for self-supervised [22] Leonardo Pepino, Pablo Riera, and Luciana Ferrer, “Emotion
learning of speech representations,” in NeurIPS, 2020. Recognition from Speech Using wav2vec 2.0 Embeddings,” in
Proc. Interspeech 2021, 2021, pp. 3400–3404.
[11] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman [23] Yangyang Xia, Li-Wei Chen, Alexander Rudnicky, and
Mohamed, “Hubert: Self-supervised speech representation Richard M Stern, “Temporal context in speech emotion recog-
learning by masked prediction of hidden units,” arXiv preprint nition,” in Proc. Interspeech, 2021, vol. 2021, pp. 3370–3374.
arXiv:2106.07447, 2021.
[24] Zhiyun Fan, Meng Li, Shiyu Zhou, and Bo Xu, “Exploring
[12] Pengcheng Li, Yan Song, Ian Vince McLoughlin, Wu Guo, and wav2vec 2.0 on Speaker Verification and Language Identifica-
Lirong Dai, “An attention pooling based representation learn- tion,” in Proc. Interspeech 2021, 2021, pp. 1509–1513.
ing method for speech emotion recognition,” in Interspeech.
[25] Nik Vaessen and David A Van Leeuwen, “Fine-tuning
2018, pp. 3087–3091, ISCA.
wav2vec2 for speaker recognition,” in ICASSP 2022-2022
[13] Xixin Wu, Songxiang Liu, Yuewen Cao, Xu Li, Jianwei Yu, IEEE International Conference on Acoustics, Speech and Sig-
Dongyang Dai, Xi Ma, Shoukang Hu, Zhiyong Wu, Xunying nal Processing (ICASSP). IEEE, 2022, pp. 7967–7971.
[26] Emanuele Bastianelli, Andrea Vanzo, Pawel Swietojanski, and
Verena Rieser, “Slurp: A spoken language understanding re-
source package,” arXiv preprint arXiv:2011.13205, 2020.
[27] Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku
Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nau-
man Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, et al.,
“Speechbrain: A general-purpose speech toolkit,” arXiv
preprint arXiv:2106.04624, 2021.
[28] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Toutanova, “BERT: pre-training of deep bidirectional trans-
formers for language understanding,” in NAACL-HLT (1).
2019, pp. 4171–4186, Association for Computational Linguis-
tics.
[29] Diederik P. Kingma and Jimmy Ba, “Adam: A method for
stochastic optimization,” in ICLR (Poster), 2015.
[30] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
Schmidhuber, “Connectionist temporal classification: la-
belling unsegmented sequence data with recurrent neural net-
works,” in Proceedings of the 23rd international conference
on Machine learning, 2006, pp. 369–376.
[31] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev
Khudanpur, “Librispeech: An asr corpus based on public do-
main audio books,” in 2015 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2015, pp.
5206–5210.
[32] Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny
Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien
Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fue-
gen, et al., “Libri-light: A benchmark for asr with limited or no
supervision,” in ICASSP 2020-2020 IEEE International Con-
ference on Acoustics, Speech and Signal Processing (ICASSP).
IEEE, 2020, pp. 7669–7673.
[33] Loren Lugosch, Piyush Papreja, Mirco Ravanelli, Abdelwahab
Heba, and Titouan Parcollet, “Timers and Such: A Practical
Benchmark for Spoken Language Understanding with Num-
bers,” NeurIPS Datasets and Benchmarks, 2021.
[34] Haytham M Fayek, Margaret Lech, and Lawrence Cavedon,
“Evaluating deep learning architectures for speech emotion
recognition,” Neural Networks, vol. 92, pp. 60–68, 2017.
[35] Mingke Xu, Fan Zhang, and Wei Zhang, “Head fusion: Im-
proving the accuracy and robustness of speech emotion recog-
nition on the iemocap and ravdess dataset,” IEEE Access, vol.
9, pp. 74539–74549, 2021.
[36] Pengcheng Li, Yan Song, Ian Vince McLoughlin, Wu Guo,
and Li-Rong Dai, “An attention pooling based representation
learning method for speech emotion recognition,” Interspeech,
2018.
[37] Amirhossein Hajavi and Ali Etemad, “Siamese capsule net-
work for end-to-end speaker recognition in the wild,” in
ICASSP 2021-2021 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp.
7203–7207.
[38] Seunghyun Seo, Donghyun Kwak, and Bowon Lee, “Integra-
tion of pre-trained networks with continuous token interface
for end-to-end spoken language understanding,” arXiv preprint
arXiv:2104.07253, 2021.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy