2111.02735v3 - Speech Emotion Detection
2111.02735v3 - Speech Emotion Detection
ity on constructing problem-agnostic representations. The code and the pre-training and ASR fine-tuning of wav2vec 2.0, please refer to
fine-tuned models for SER and SLU have been open-sourced on [10].
SpeechBrain [27] 1 . In this work, we compare four released wav2vec 2.0 pre-trained
models: the wav2vec 2.0 base model (12 transformer blocks and 768
embedding dimension) and its ASR fine-tuned version, the wav2vec
2. METHOD
2.0 large model (24 transformer blocks and 1024 embedding dimen-
In this section, we will first introduce the pre-training of wav2vec sion) and its ASR fine-tuned version. Both base and large models
2.0/HuBERT model, then we will show our fine-tuning methods and are pre-trained on 960h LibriSpeech [31] data, which is also used
downstream models for each task. for their ASR fine-tuning. ASR fine-tuned models for both wav2vec
2.0 and HuBERT are taken into consideration because we assume
that some tasks may benefit from the ASR fine-tuning.
2.1. Pretrained wav2vec 2.0
The wav2vec 2.0 pre-training is similar to the masked language mod- 2.2. Pretrained HuBERT
elling in BERT [28] and is carried out under a self-supervised setting.
Contiguous time steps from the CNN encoder representations are In the same way as wav2vec 2.0, CNN-encoded audio features are
randomly masked, and the model is trained to reproduce the quan- randomly masked in HuBERT. To generate labels for the first iter-
tized local encoder representations for masked frames at the output ation of HuBERT pre-training, a k-means clustering is applied on
of the contextualized encoder. 39-dimensional MFCC features. To generate better targets for the
subsequent iterations, k-means clustering then works on the latent
exp(sim(ct , qt )/κ) features extracted from the HuBERT model pre-trained in the previ-
Lm = − log P (1)
q̃∈Qt (exp(sim(ct , q̃)/κ)
ous iteration. A projection layer is added over transformer blocks to
predict cluster labels. Cross-entropy loss is computed over masked
The training objective is illustrated in Eq.1, where sim(ct , qt ) is timestamps, which can be defined as:
the cosine similarity between the contextualized encoder outputs ct
and the quantized CNN encoder representations qt , t is the masked XX (k) (k)
Lm (f ; X, {Z (k) }k , M ) = log pf (zt |X,
e t) (2)
time step, Qt is the union of candidate representations q̃ which in-
t∈M k
cludes qt and K = 100 distractors, κ is the temperature which is
set to 0.1. The distractors are outputs of the local encoder sampled M ⊂ [T ] denotes the set of indices to be masked for a length-T
from masked frames belonging to the same utterance as qt . The con-
sequence X, and X e = r(X; M ) denotes a corrupted version of
trastive loss is then given by Lm summed over all masked frames.
X where xt is replaced with a mask embedding x e if t ∈ M . A
At the end, an L2 regularization is added to the contrastive loss, as
well as a diversity loss to increase the use of the quantized codebook masked prediction model f takes as input X e and predicts a distribu-
representations. tion over the target indices at each timestep pf (·|X;
e t). To improve
The pre-training process is optimized with Adam [29] and the target quality, cluster ensembles are utillized in case that an individ-
learning rate decays linearly after a waming up. In [10], wav2vec 2.0 ual clustering model performs badly, Z (k) then denotes the target
is also fine-tuned on ASR aiming to improve ASR performance. For sequences generated by the k-th clustering model.
ASR fine-tuning, a randomly initialized linear projection is added HuBERT pre-training uses the same optimizer and learning rate
to the output of the contextual encoder and the CTC (Connectionist scheduler as wav2vec 2.0. For ASR fine-tuning, the projection layer
Temporal Classification [30]) loss is minimized. For more details of is removed and replaced by a randomly initialized softmax layer,
then the CTC loss is optimized. For more details of the pre-training
1 https://github.com/speechbrain/speechbrain/tree/develop/recipes of HuBERT, please refer to [11].
cosine-similarity scores are then produced for SV on the pre-trained
SID embeddings before the linear classification layer.
For SLU (Fig.2), another attentional GRU-based decoder is
added to decode semantic information directly from the fine-tuned
wav2vec2.0/HuBERT embedding. In our work, intents and slots
are both treated as a sequence-to-sequence ASR task and are both
decoded from the attentional decoder. The Negative Log-Likelihood
(NLL) loss is then calculated over a character-level token genera-
tion. Following the observations of [33], we utilized a beam-search
with a beam of 80 without coverage penalty to identify the optimum
sequence for validation set and test set.
3. EXPERIMENTS
4. CONCLUSIONS