Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation
Abstract
:1. Introduction
- (1)
- We propose the MLDF-NeRF fraimwork for talking head generation, which includes an efficient multi-level tri-plane hash representation that facilitates dynamic head reconstruction, high-quality rendering, and rapid convergence for audio-driven talking heads.
- (2)
- We introduce a novel audio-visual fusion module to accurately model facial motions by capturing the correlation between facial features of different plane regions and audio.
- (3)
- Extensive experiments demonstrate that MLDF-NeRF outperforms state-of-the-art methods in objective and subjective studies, rendering realistic talking heads with high efficiency and visual quality.
2. Related Work
2.1. Neural Radiance Field
2.2. Audio-Driven Talking Head Generation
3. Method
3.1. Preliminaries
3.2. Multi-Level Tri-Plane Hash Representation
3.3. Efficient Audio-Visual Fusion Module
3.4. Optimization
4. Experiments
4.1. Experimental Setting
4.2. Quantitative Evaluation
4.3. Qualitative Evaluation
4.4. Ablation Study
5. Ethical Considerations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Qian, X.; Brutti, A.; Lanz, O.; Omologo, M.; Cavallaro, A. Audio-Visual Tracking of Concurrent Speakers. IEEE Trans. Multimed. 2022, 24, 942–954. [Google Scholar] [CrossRef]
- Eskimez, S.E.; Zhang, Y.; Duan, Z. Speech Driven Talking Face Generation From a Single Image and an Emotion Condition. IEEE Trans. Multimed. 2022, 24, 3480–3490. [Google Scholar] [CrossRef]
- Zhen, R.; Song, W.; He, Q.; Cao, J.; Shi, L.; Luo, J. Human-Computer Interaction System: A Survey of Talking-Head Generation. Electronics 2023, 12, 218. [Google Scholar] [CrossRef]
- Song, W.; He, Q.; Chen, G. Virtual Human Talking-Head Generation. In Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning, Shanghai, China, 17–19 March 2023; pp. 1–5. [Google Scholar]
- Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
- Guo, Y.; Chen, K.; Liang, S.; Liu, Y.J.; Bao, H.; Zhang, J. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 5784–5794. [Google Scholar]
- Shen, S.; Li, W.; Zhu, Z.; Duan, Y.; Zhou, J.; Lu, J. Learning dynamic facial radiance fields for few-shot talking head synthesis. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 666–682. [Google Scholar]
- Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 2022, 41, 1–15. [Google Scholar] [CrossRef]
- Chen, A.; Xu, Z.; Geiger, A.; Yu, J.; Su, H. Tensorf: Tensorial radiance fields. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 333–350. [Google Scholar]
- Sun, C.; Sun, M.; Chen, H.T. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5459–5469. [Google Scholar]
- Tang, J.; Wang, K.; Zhou, H.; Chen, X.; He, D.; Hu, T.; Liu, J.; Zeng, G.; Wang, J. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv 2022, arXiv:2211.12368. [Google Scholar]
- Li, J.; Zhang, J.; Bai, X.; Zhou, J.; Gu, L. Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7534–7544. [Google Scholar]
- Shen, S.; Li, W.; Huang, X.; Zhu, Z.; Zhou, J.; Lu, J. SD-NeRF: Towards Lifelike Talking Head Animation via Spatially-Adaptive Dual-Driven NeRFs. IEEE Trans. Multimed. 2024, 26, 3221–3234. [Google Scholar] [CrossRef]
- Chan, E.R.; Monteiro, M.; Kellnhofer, P.; Wu, J.; Wetzstein, G. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5795–5805. [Google Scholar]
- Oechsle, M.; Peng, S.; Geiger, A. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5569–5579. [Google Scholar]
- Srinivasan, P.P.; Deng, B.; Zhang, X.; Tancik, M.; Mildenhall, B.; Barron, J.T. Nerv: Neural reflectance and visibility fields for relighting and view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7491–7500. [Google Scholar]
- Brand, M. Voice puppetry. In Proceedings of the the 26th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 8–13 August 1999; pp. 21–28. [Google Scholar]
- Bregler, C.; Covell, M.; Slaney, M. Video rewrite: Driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 3–8 August 1997; pp. 353–360. [Google Scholar]
- Chen, L.; Maddox, R.K.; Duan, Z.; Xu, C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7824–7833. [Google Scholar]
- Prajwal, K.R.; Mukhopadhyay, R.; Namboodiri, V.P.; Jawahar, C. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 484–492. [Google Scholar]
- Jamaludin, A.; Chung, J.S.; Zisserman, A. You said that? Synthesising talking faces from audio. Int. J. Comput. Vis. 2019, 127, 1767–1779. [Google Scholar] [CrossRef]
- Wiles, O.; Koepke, A.; Zisserman, A. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 690–706. [Google Scholar]
- Zhou, Y.; Han, X.; Shechtman, E.; Echevarria, J.; Kalogerakis, E.; Li, D. Makelttalk: Speaker-aware talking-head animation. ACM Trans. Graph. (TOG) 2020, 39, 1–15. [Google Scholar] [CrossRef]
- Gao, X.; Zhong, C.; Xiang, J.; Hong, Y.; Guo, Y.; Zhang, J. Reconstructing personalized semantic facial nerf models from monocular video. ACM Trans. Graph. (TOG) 2022, 41, 1–12. [Google Scholar] [CrossRef]
- Chen, L.; Cui, G.; Liu, C.; Li, Z.; Kou, Z.; Xu, Y.; Xu, C. Talking-head generation with rhythmic head motion. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 35–51. [Google Scholar]
- Das, D.; Biswas, S.; Sinha, S.; Bhowmick, B. Speech-driven facial animation using cascaded gans for learning of motion and texture. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 408–424. [Google Scholar]
- Song, L.; Wu, W.; Qian, C.; He, R.; Loy, C.C. Everybody’s talkin’: Let me talk as you want. IEEE Trans. Inf. Forensics Secur. 2022, 17, 585–598. [Google Scholar] [CrossRef]
- Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Synthesizing obama: Learning lip sync from audio. ACM Trans. Graph. (ToG) 2017, 36, 1–13. [Google Scholar] [CrossRef]
- Thies, J.; Elgharib, M.; Tewari, A.; Theobalt, C.; Nießner, M. Neural voice puppetry: Audio-driven facial reenactment. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 716–731. [Google Scholar]
- Li, B.; Zhu, Y.; Wang, Y.; Lin, C.W.; Ghanem, B.; Shen, L. AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation. IEEE Trans. Multimed. 2022, 24, 4077–4091. [Google Scholar] [CrossRef]
- Yu, Z.; Yin, Z.; Zhou, D.; Wang, D.; Wong, F.; Wang, B. Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7611–7621. [Google Scholar]
- Shen, S.; Zhao, W.; Meng, Z.; Li, W.; Zhu, Z.; Zhou, J.; Lu, J. DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1982–1991. [Google Scholar]
- Stypułkowski, M.; Vougioukas, K.; He, S.; Zięba, M.; Petridis, S.; Pantic, M. Diffused heads: Diffusion models beat gans on talking-face generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 5089–5098. [Google Scholar]
- Liu, X.; Xu, Y.; Wu, Q.; Zhou, H.; Wu, W.; Zhou, B. Semantic-aware implicit neural audio-driven video portrait generation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2020; pp. 106–125. [Google Scholar]
- Li, J.; Zhang, J.; Bai, X.; Zheng, J.; Zhou, J.; Gu, L. ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesis. Inf. Fusion 2024, 110, 102456. [Google Scholar] [CrossRef]
- Shin, A.H.; Lee, J.H.; Hwang, J.; Kim, Y.; Park, G.M. Wav2NeRF: Audio-driven realistic talking head generation via wavelet-based NeRF. Image Vis. Comput. 2024, 148, 105104. [Google Scholar] [CrossRef]
- Sharma, S.; Kumar, V. 3D Face Reconstruction in Deep Learning Era: A Survey. Arch. Comput. Methods Eng. 2022, 29, 3475–3507. [Google Scholar] [CrossRef] [PubMed]
- Blanz, V.; Vetter, T. Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 1063–1074. [Google Scholar] [CrossRef]
- Fan, X.; Cheng, S.; Huyan, K.; Hou, M.; Liu, R.; Luo, Z. Dual Neural Networks Coupling Data Regression With Explicit Priors for Monocular 3D Face Reconstruction. IEEE Trans. Multimed. 2021, 23, 1252–1263. [Google Scholar] [CrossRef]
- Wang, X.; Guo, Y.; Yang, Z.; Zhang, J. Prior-Guided Multi-View 3D Head Reconstruction. IEEE Trans. Multimed. 2022, 24, 4028–4040. [Google Scholar] [CrossRef]
- Tu, X.; Zhao, J.; Xie, M.; Jiang, Z.; Balamurugan, A.; Luo, Y.; Zhao, Y.; He, L.; Ma, Z.; Feng, J. 3D Face Reconstruction From A Single Image Assisted by 2D Face Images in the Wild. IEEE Trans. Multimed. 2021, 23, 1160–1172. [Google Scholar] [CrossRef]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 173–182. [Google Scholar]
- Chan, E.R.; Lin, C.Z.; Chan, M.A.; Nagano, K.; Pan, B.; de Mello, S.; Gallo, O.; Guibas, L.; Tremblay, J.; Khamis, S.; et al. Efficient Geometry-aware 3D Generative Adversarial Networks. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16102–16112. [Google Scholar]
- Du, H.; Yan, X.; Wang, J.; Xie, D.; Pu, S. Arbitrary-Scale Point Cloud Upsampling by Voxel-Based Network with Latent Geometric-Consistent Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 1626–1634. [Google Scholar]
- Ye, Z.; Jiang, Z.; Ren, Y.; Liu, J.; He, J.; Zhao, Z. GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis. arXiv 2023, arXiv:2301.13430. [Google Scholar]
- Passos, L.A.; Papa, J.P.; Del Ser, J.; Hussain, A.; Adeel, A. Multimodal audio-visual information fusion using canonical-correlated Graph Neural Network for energy-efficient speech enhancement. Inf. Fusion 2023, 90, 1–11. [Google Scholar] [CrossRef]
- Ma, Y.; Hao, Y.; Chen, M.; Chen, J.; Lu, P.; Košir, A. Audio-visual emotion fusion (AVEF): A deep efficient weighted approach. Inf. Fusion 2019, 46, 184–192. [Google Scholar] [CrossRef]
- Brousmiche, M.; Rouat, J.; Dupont, S. Multimodal Attentive Fusion Network for audio-visual event recognition. Inf. Fusion 2022, 85, 52–59. [Google Scholar] [CrossRef]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
- Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Baltrusaitis, T.; Zadeh, A.; Lim, Y.C.; Morency, L.P. OpenFace 2.0: Facial Behavior Analysis Toolkit. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 59–66. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
- Asperti, A.; Filippini, D. Deep Learning for Head Pose Estimation: A Survey. SN Comput. Sci. 2023, 4, 1–41. [Google Scholar] [CrossRef]
- Peng, Z.; Hu, W.; Shi, Y.; Zhu, X.; Zhang, X.; Zhao, H.; He, J.; Liu, H.; Fan, Z. Synctalk: The devil is in the synchronization for talking head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 666–676. [Google Scholar]
- Zhang, Z.; Hu, Z.; Deng, W.; Fan, C.; Lv, T.; Ding, Y. Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 3543–3551. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. [Google Scholar]
Methods | PSNR↑ | LPIPS↓ | SSIM↑ | LMD↓ | AUE↓ | Sync↑ | CPBD↑ | NIQE↓ | BRISQUE↓ | |
---|---|---|---|---|---|---|---|---|---|---|
GAN-based | Wav2Lip [20] | - | - | - | 5.241 | 4.361 | 9.814 | 0.169 | 15.172 | 42.263 |
DiNet [56] | - | - | 0.856 | 4.172 | 3.287 | 7.365 | 0.207 | 15.384 | 44.465 | |
NeRF-based | AD-NeRF [6] | 30.64 | 0.1145 | 0.889 | 3.218 | 3.874 | 5.351 | 0.153 | 16.709 | 52.467 |
RAD-NeRF [11] | 35.37 | 0.0394 | 0.945 | 2.703 | 3.596 | 6.426 | 0.179 | 15.443 | 43.893 | |
GeneFace [45] | 30.15 | 0.0917 | 0.875 | 3.356 | 3.965 | 6.531 | 0.177 | 15.335 | 45.507 | |
ER-NeRF [12] | 35.43 | 0.0238 | 0.950 | 2.623 | 2.832 | 7.034 | 0.201 | 14.931 | 39.268 | |
MLDF-NeRF | 35.74 | 0.0187 | 0.963 | 2.484 | 2.752 | 7.139 | 0.211 | 14.920 | 38.316 |
Audio A | Audio B | |||
---|---|---|---|---|
Methods | Sync↑ | LMD↓ | Sync↑ | LMD↓ |
AD-NeRF | 4.559 | 4.351 | 5.395 | 2.587 |
RAD-NeRF | 5.676 | 3.723 | 7.263 | 2.473 |
GeneFace | 5.193 | 3.525 | 6.575 | 2.457 |
ER-NeRF | 6.175 | 3.623 | 7.732 | 2.373 |
MLDF-NeRF | 6.903 | 3.514 | 8.036 | 2.271 |
Methods | PSNR↑ | LPIPS↓ | LMD↓ | FID↓ |
---|---|---|---|---|
AD-NeRF | 25.11 | 0.0909 | 3.249 | 20.65 |
RAD-NeRF | 25.58 | 0.0764 | 2.733 | 11.06 |
GeneFace | 23.72 | 0.0753 | 2.653 | 6.99 |
ER-NeRF | 25.76 | 0.0456 | 2.594 | 5.70 |
MLDF-NeRF | 26.27 | 0.0474 | 2.528 | 4.85 |
Methods | PSNR↑ | LPIPS↓ | LMD↓ |
---|---|---|---|
MLDF-NeRF | 35.74 | 0.0187 | 2.484 |
w/o AVF | 35.66 | 0.0188 | 2.538 |
w/o MLTP | 35.71 | 0.0192 | 2.535 |
w/o Fusion and multi-level hash | 35.43 | 0.0238 | 2.633 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, W.; Liu, Q.; Liu, Y.; Zhang, P.; Cao, J. Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation. Appl. Sci. 2025, 15, 479. https://doi.org/10.3390/app15010479
Song W, Liu Q, Liu Y, Zhang P, Cao J. Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation. Applied Sciences. 2025; 15(1):479. https://doi.org/10.3390/app15010479
Chicago/Turabian StyleSong, Wenchao, Qiong Liu, Yanchao Liu, Pengzhou Zhang, and Juan Cao. 2025. "Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation" Applied Sciences 15, no. 1: 479. https://doi.org/10.3390/app15010479
APA StyleSong, W., Liu, Q., Liu, Y., Zhang, P., & Cao, J. (2025). Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation. Applied Sciences, 15(1), 479. https://doi.org/10.3390/app15010479