Self-Attention For Audio Super-Resolution
Self-Attention For Audio Super-Resolution
Self-Attention For Audio Super-Resolution
ABSTRACT
arXiv:2108.11637v1 [cs.SD] 26 Aug 2021
Figure 2. Our implementation uses K = 4 and a Transformer Table 1: Quantitative evaluation of audio super-resolution
block that is composed of a stack of 4 layers with the number models at different upsampling rates. Left/right results are
of heads h = 8 and hidden dimension d = 2048. SNR/LSD (higher is better for SNR while lower is better for
LSD). Baseline results are those reported in [1].
4.2. Training details The metrics used to evaluate the model are the the signal to
noise ratio (SNR) and the log-spectral distance (LSD) [44].
Our model is trained for 50 epochs on patches of length 8192, These are standard metrics used in the signal processing lit-
as are existing audio super-resolution models. This ensures erature. Given a reference signal y and the corresponding ap-
a fair comparison. The low-resolution audio signals are first proximation x, the SNR is defined as
processed with bicubic upscaling before they are fed into the
model. The learning-rate is set to 3 × 10−4 . The model is ||y||22
optimized using Adam [43] with β1 = 0.9 and β2 = 0.999. SNR(x, y) = 10 log (3)
||x − y||22
Table 2: Training time evaluation. The number of seconds Generalization of the super-resolution model. We inves-
per epoch was obtained using an NVIDIA Tesla K80 GPU. tigate the model’s ability to generalize to other domains. In
order to do this, we switch from speech to music and the other
Model TFiLM AFiLM way around. The results are presented in Table 3. As seen in
Number of parameters 6.82e7 1.34e8 previous models [1,7], the super-resolved samples contain the
Seconds per epoch 370 276 high frequency details but still sound noisy. The models spe-
cialize to the specific type of audio they are trained on.
The LSD measures the reconstruction quality of individual
frequencies and is defined as 6. CONCLUSION
v
T u K 2 In this work, we introduce the use of self-attention for the
1 Xu t1
X
LSD(x, y) = X(t, k) − X̂(t, k) (4) audio super-resolution task. We present the Attention-based
T t=1 K Feature-Wise Linear Modulation (AFiLM) layer which relies
k=1
on attention instead of recurrent neural networks to alter the
where X and X̂ are the log-spectral power magnitudes of y activations of the convolutional model. The resulting model
and x defined as X = log |S|2 , where S is the short-time efficiently captures long-range temporal interactions. It out-
Fourier transform (STFT) of the signal. t and k are respec- performs all previous models and can be trained faster.
tively index frames and frequencies. In future work, we want to develop super-resolution mod-
We compare our best models with existing approaches at els that generalize well to different types of inputs. We also
upscaling ratios 2, 4 and 8. The results are presented in Ta- want to investigate perceptual-based models.
ble 1. All the baseline results are those reported in [1]. Our
contributions result in an average improvement of 0.2 dB over
7. ACKNOWLEDGEMENTS
the TFiLM in terms of SNR. Concerning the LSD metric, our
approach improves by 0.3 dB on average. This shows that our We would like to thank Bruce Basset for his helpful comments
model effectively uses the self-attention mechanism to cap- and advice.
ture long-term information in the audio signal. Our attention-
based model outperforms all previous models on the multi
speaker task. It is the most difficult task and is the one that 8. REFERENCES
benefits most from more long-term information.
[1] Sawyer Birnbaum, Volodymyr Kuleshov, Zayd Enam,
Table 3: Out-of-distribution evaluation of the model. The Pang Wei W Koh, and Stefano Ermon, “Temporal
model is trained on the VCTK Multispeaker and Piano film: Capturing long-range sequence dependencies with
datasets at scale r = 2 and tested both on the same and on feature-wise modulations.,” in Advances in Neural In-
the other dataset. Left/right results are SNR/LSD. formation Processing Systems, 2019, pp. 10287–10298.
VCTK Multi (Test) Piano (Test) [2] Per Ekstrand, “Bandwidth extension of audio signals
VCTK Multi (Train) 20.0/1.7 24.4/2.5 by spectral band replication,” in in Proceedings of the
Piano (Train) 10.2/2.6 25.7/1.5 1st IEEE Benelux Workshop on Model Based Processing
and Coding of Audio (MPCA’02. Citeseer, 2002.
[16] Yan Ming Cheng, Douglas O’Shaughnessy, and Paul [26] Shichao Hu, Bin Zhang, Beici Liang, Ethan Zhao, and
Mermelstein, “Statistical recovery of wideband speech Simon Lui, “Phase-aware music super-resolution us-
from narrowband speech,” IEEE Transactions on ing generative adversarial networks,” arXiv preprint
Speech and Audio Processing, vol. 2, no. 4, pp. 544– arXiv:2010.04506, 2020.
548, 1994.
[27] Sen Li, Stéphane Villette, Pravin Ramadas, and Daniel J
[17] Hannu Pulakka, Ulpu Remes, Kalle Palomäki, Mikko Sinder, “Speech bandwidth extension using genera-
Kurimo, and Paavo Alku, “Speech bandwidth extension tive adversarial networks,” in 2018 IEEE International
using gaussian mixture model-based estimation of the Conference on Acoustics, Speech and Signal Processing
highband mel spectrum,” in 2011 IEEE International (ICASSP). IEEE, 2018, pp. 5029–5033.
[28] Xinyu Li, Venkata Chebiyyam, Katrin Kirchhoff, and Proceedings of the IEEE conference on computer vision
AI Amazon, “Speech audio super-resolution for speech and pattern recognition, 2016, pp. 770–778.
recognition.,” in INTERSPEECH, 2019, pp. 3416–3420.
[39] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
[29] Rithesh Kumar, Kundan Kumar, Vicki Anand, Yoshua Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout:
Bengio, and Aaron Courville, “Nu-gan: High reso- a simple way to prevent neural networks from overfit-
lution neural upsampling with gan,” arXiv preprint ting,” The journal of machine learning research, vol.
arXiv:2010.11362, 2020. 15, no. 1, pp. 1929–1958, 2014.
[30] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, [40] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert,
Courville, and Yoshua Bengio, “Generative adversarial and Zehan Wang, “Real-time single image and video
networks,” arXiv preprint arXiv:1406.2661, 2014. super-resolution using an efficient sub-pixel convolu-
tional neural network,” in Proceedings of the IEEE
[31] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- conference on computer vision and pattern recognition,
gio, “Neural machine translation by jointly learning to 2016, pp. 1874–1883.
align and translate,” arXiv preprint arXiv:1409.0473,
2014. [41] Ethan Perez, Florian Strub, Harm De Vries, Vincent Du-
moulin, and Aaron Courville, “Film: Visual reason-
[32] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad ing with a general conditioning layer,” arXiv preprint
Norouzi, and Samy Bengio, “Neural combinatorial op- arXiv:1709.07871, 2017.
timization with reinforcement learning,” arXiv preprint
arXiv:1611.09940, 2016. [42] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E
Hinton, “Layer normalization,” arXiv preprint
[33] Yuma Koizumi, Kohei Yaiabe, Marc Delcroix, Yoshiki
arXiv:1607.06450, 2016.
Maxuxama, and Daiki Takeuchi, “Speech enhancement
using self-adaptation and multi-head self-attention,” in [43] Diederik P Kingma and Jimmy Ba, “Adam: A
ICASSP 2020-2020 IEEE International Conference on method for stochastic optimization,” arXiv preprint
Acoustics, Speech and Signal Processing (ICASSP). arXiv:1412.6980, 2014.
IEEE, 2020, pp. 181–185.
[44] Augustine Gray and John Markel, “Distance measures
[34] Xiang Hao, Changhao Shan, Yong Xu, Sining Sun, and for speech processing,” IEEE Transactions on Acous-
Lei Xie, “An attention-based neural network approach tics, Speech, and Signal Processing, vol. 24, no. 5, pp.
for single channel speech enhancement,” in ICASSP 380–391, 1976.
2019-2019 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP). IEEE,
2019, pp. 6895–6899.
[35] Ritwik Giri, Umut Isik, and Arvindh Krishnaswamy,
“Attention wave-u-net for speech enhancement,” in
2019 IEEE Workshop on Applications of Signal Process-
ing to Audio and Acoustics (WASPAA). IEEE, 2019, pp.
249–253.
[36] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon
Shlens, and Quoc V Le, “Attention augmented con-
volutional networks,” in Proceedings of the IEEE In-
ternational Conference on Computer Vision, 2019, pp.
3286–3295.
[37] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki
Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang,
Zhengdong Zhang, Yonghui Wu, et al., “Conformer:
Convolution-augmented transformer for speech recog-
nition,” arXiv preprint arXiv:2005.08100, 2020.
[38] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
Sun, “Deep residual learning for image recognition,” in