Self-Attention For Audio Super-Resolution
arXiv:2108.11637v1 [cs.SD] 26 Aug 2021
Figure 2. Our implementation uses K = 4 and a Transformer Table 1: Quantitative evaluation of audio super-resolution
block that is composed of a stack of 4 layers with the number models at different upsampling rates. Left/right results are
of heads h = 8 and hidden dimension d = 2048. SNR/LSD (higher is better for SNR while lower is better for
LSD). Baseline results are those reported in [1].
4.2. Training details The metrics used to evaluate the model are the the signal to
noise ratio (SNR) and the log-spectral distance (LSD) [44].
Our model is trained for 50 epochs on patches of length 8192, These are standard metrics used in the signal processing lit-
as are existing audio super-resolution models. This ensures erature. Given a reference signal y and the corresponding ap-
a fair comparison. The low-resolution audio signals are first proximation x, the SNR is defined as
processed with bicubic upscaling before they are fed into the
model. The learning-rate is set to 3 × 10−4 . The model is ||y||22
optimized using Adam [43] with β1 = 0.9 and β2 = 0.999. SNR(x, y) = 10 log (3)
||x − y||22
Table 2: Training time evaluation. The number of seconds Generalization of the super-resolution model. We inves-
per epoch was obtained using an NVIDIA Tesla K80 GPU. tigate the model’s ability to generalize to other domains. In
order to do this, we switch from speech to music and the other
Model TFiLM AFiLM way around. The results are presented in Table 3. As seen in
Number of parameters 6.82e7 1.34e8 previous models [1,7], the super-resolved samples contain the
Seconds per epoch 370 276 high frequency details but still sound noisy. The models spe-
cialize to the specific type of audio they are trained on.
The LSD measures the reconstruction quality of individual
frequencies and is defined as 6. CONCLUSION
T u K 2 In this work, we introduce the use of self-attention for the
1 Xu t1
LSD(x, y) = X(t, k) − X̂(t, k) (4) audio super-resolution task. We present the Attention-based
T t=1 K Feature-Wise Linear Modulation (AFiLM) layer which relies
on attention instead of recurrent neural networks to alter the
where X and X̂ are the log-spectral power magnitudes of y activations of the convolutional model. The resulting model
and x defined as X = log |S|2 , where S is the short-time efficiently captures long-range temporal interactions. It out-
Fourier transform (STFT) of the signal. t and k are respec- performs all previous models and can be trained faster.
tively index frames and frequencies. In future work, we want to develop super-resolution mod-
We compare our best models with existing approaches at els that generalize well to different types of inputs. We also
upscaling ratios 2, 4 and 8. The results are presented in Ta- want to investigate perceptual-based models.
ble 1. All the baseline results are those reported in [1]. Our
contributions result in an average improvement of 0.2 dB over
the TFiLM in terms of SNR. Concerning the LSD metric, our
approach improves by 0.3 dB on average. This shows that our We would like to thank Bruce Basset for his helpful comments
model effectively uses the self-attention mechanism to cap- and advice.
ture long-term information in the audio signal. Our attention-
based model outperforms all previous models on the multi
speaker task. It is the most difficult task and is the one that 8. REFERENCES
benefits most from more long-term information.
