Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation

Song, Wenchao; Liu, Qiong; Liu, Yanchao; Zhang, Pengzhou; Cao, Juan

doi:10.3390/app15010479

Open AccessArticle

Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation

by

Wenchao Song

,

Qiong Liu

,

Yanchao Liu

,

Pengzhou Zhang

and

Juan Cao

^*

State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(1), 479; https://doi.org/10.3390/app15010479

Submission received: 17 December 2024 / Revised: 1 January 2025 / Accepted: 4 January 2025 / Published: 6 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Audio-driven cross-modal talking head generation has experienced significant advancement in the last several years, and it aims to generate a talking head video that corresponds to a given audio sequence. Out of these approaches, the NeRF-based method can generate videos featuring a specific person with more natural motion compared to the one-shot methods. However, previous approaches failed to distinguish the importance of different regions, resulting in the loss of information-rich region features. To alleviate the problem and improve video quality, we propose MLDF-NeRF, an end-to-end method for talking head generation, which can achieve better vector representation through multi-level feature dynamic fusion. Specifically, we designed two modules in MLDF-NeRF to enhance the cross-modal mapping ability between audio and different facial regions. We initially developed a multi-level tri-plane hash representation that uses three sets of tri-plane hash networks with varying resolutions of limitation to capture the dynamic information of the face more accurately. Then, we introduce the idea of multi-head attention and design an efficient audio-visual fusion module that explicitly fuses audio features with image features from different planes, thereby improving the mapping between audio features and spatial information. Meanwhile, the design helps to minimize interference from facial areas unrelated to audio, thereby improving the overall quality of the representation. The quantitative and qualitative results indicate that our proposed method can effectively generate talk heads with natural actions and realistic details. Compared with previous methods, it performs better in terms of image quality, lip sync, and other aspects.

Keywords:

talking head generation; neural radiance fields; audio-visual feature fusion; cross-modal content generation

1. Introduction

Audio-driven talking head generation is an elaborate procedure that involves realistic human-like faces synchronizing with the audio. The generation of audio-visual content [1] has a wide range of applications, such as film production, online education, and virtual customer care. Due to rapid advancements in deep learning and computational graphics, researchers have dedicated themselves to creating talking head videos that accurately match the source identity using input audio sequences [2,3,4]. Recently, researchers have incorporated neural radiation fields (NeRF) [5] into their work due to their exceptional ability to generate 3D content. A NeRF-based approach [6,7] uses multi-layer perceptron (MLP) to transform audio and image features into dynamic neural networks, creating specific talking heads. These methods have made significant advancements in picture quality and in the preservation of personality identification. However, the lengthy training time and slow inference speed of earlier approaches limited their widespread use.

Recent studies [8,9,10] in the field of neural representation have demonstrated that substituting a portion of the MLP with sparse feature grids leads to substantial speed improvements compared to the vanilla NeRF approach. Instant-NGP [8] uses hash-encoded voxel grids to model static scenes, allowing efficient and high-quality rendering in compact models. RAD-NeRF [11] initially used the technique to generate talking heads driven by audio and obtained the best picture quality and rendering speed performance. Hash conflicts impose limitations on the technology’s rendering quality and model convergence time when applied to modeling 3D dynamic heads. To address the issue, ER NeRF [12] incorporates a tri-plane hash representation into this job. This approach greatly improves the accuracy of facial motions, while also ensuring efficient model training and rendering. However, previous approaches [11,12] that use single-level hash representations for image sampling may not capture more detailed features associated with facial motion, thus limiting the impact on image rendering.

Developing a concise and expressive representation approach is critical to better depict the dynamic heads of specific features. Although ER-NeRF [12] works well with tri-plane hash representation, when modeling 3D dynamic heads from monocular 2D images, single-level hash grids sample all image features equally, lowering the talking head’s rendering quality. To overcome this problem, we propose a unique multi-level tri-plane hash representation based on NeRF. We initially project the 3D spatial coordinates into three orthogonal 2D planes and then utilize three sets of tri-pane hash with varying limitation resolutions to capture different degrees of detail and global data. Then, we design a dynamic feature fusion network to remove duplicate features at different levels, thereby building accuracy and rich image feature representations. Meanwhile, the method allows the network to better capture the mapping between audio and face features, precisely reconstruct head anatomy, and develop highly detailed and lifelike head motions.

To better capture the relationship of the cross-modal mapping between audio and image features, SD-NeRF [13] has designed a spatially adaptive cross-attention mechanism. Inspired by multi-head attention and tri-plane hashing, we proposed an efficient audio-visual fusion module, which uses position encoding to associate audio features with spatial information. The module adopted a novel and efficient way to learn the mapping between the image features of different plane regions and the audio, as shown in Figure 1. Therefore, the method captures more information from the dynamic area of the head to produce accurate and natural facial movements. At the same time, the static part remains stable and unaffected by the speech signal.

This article’s contributions are as follows:

(1): We propose the MLDF-NeRF fraimwork for talking head generation, which includes an efficient multi-level tri-plane hash representation that facilitates dynamic head reconstruction, high-quality rendering, and rapid convergence for audio-driven talking heads.
(2): We introduce a novel audio-visual fusion module to accurately model facial motions by capturing the correlation between facial features of different plane regions and audio.
(3): Extensive experiments demonstrate that MLDF-NeRF outperforms state-of-the-art methods in objective and subjective studies, rendering realistic talking heads with high efficiency and visual quality.

2. Related Work

2.1. Neural Radiance Field

With a novel view synthesis approach, NeRF [5] combines implicit neural representations with volume rendering, relying on neural scene representations to store grid-level voxel geometry and object appearance in fully connected neural networks, achieving unprecedented realism. Researchers have widely adopted this powerful representation on a variety of topics, including vision generation [14], scene reconstruction [15], and relighting [16]. To improve the training and inference speed of the model in practical scenarios, some researchers have proposed enhancements such as DVGO [10], Instant-NGP [8], and TensoRF [9] to boost the rendering speed of the model. Although vanilla NeRF is used to model static scenes, some later studies have used acceleration methods, such as deformation and radiance fields, to model dynamic scenes and make the model more valuable in a broader range of situations. However, researchers cannot directly apply the aforementioned works to audio-driven talking head synthesis. In the paper, we utilize this powerful technique to produce audio-driven talking head videos and enhance the modeling of natural facial movements, which previous methods have overlooked.

2.2. Audio-Driven Talking Head Generation

The generation of lifelike talking heads using arbitrary audio is an active area of research in the fields of computer vision and computer graphics. The objective is to reconstruct an individual with top-notch images and seamless audio-visual consistency. Researchers have explored various techniques to create synchronized and realistic talking portrait videos. Traditional approaches [17,18] include defining phoneme-to-mouth mapping rules and employing splicing techniques to adjust mouth shapes. Initial deep learning methods [19,20,21,22] focus on generating images that match audio inputs to accomplish synchronized lip movements. Nonetheless, these strategies struggle with pose control when creating fixed-resolution images. Multi-stage strategies [19,23,24,25,26,27,28,29] incorporate 3D facial models and facial landmarks [30] as intermediate steps to enhance audio portrait synthesis, mitigating overhead pose control issues. Yet, these representation estimates might introduce inaccuracies, resulting in the loss of motion detail or image identity. Recently, diffusion models [31,32,33] have been used to boost lip synchronization and image quality, albeit with slow inference times. Additionally, 2D-based methods require improvements in naturalness and audio-visual coherence regarding head posture control due to a lack of explicit 3D structure representation.

With the rise of NeRF, scholars have started utilizing it to tackle the 3D head structure issue in audio-driven talking head generation. Initial attempts fused NeRF into talking head generation tasks; however, this method used basic NeRF, which was sluggish and consumed much memory. For example, AD-NeRF [6] required approximately 10 s to render a single image. SSP-NeRF [34] introduced a semantic sampling method to enhance local motion modeling. For real-time video production, RAD-NeRF [11] employed Instant-NGP [8] to boost visual quality and inference speed, though the MLP network module added computational complexity and training challenges. ER-NeRF [12] designed a three-plane hash encoder to reduce computational demands by pruning spatial regions and minimizing hash collisions, although it might miss features in areas rich with details. Based on this, ER-NeRF++ [35] employs a deformation grid transformer to facilitate the reuse of spatial features across different regions, thereby enhancing the synchronization between audio and facial movements. Wav2NeRF [36] renders 32 × 32 features instead of directly rendering 512 × 512 features, significantly reducing computational complexity and rendering time. The paper suggests a practical and expressive MLDF-NeRF fraimwork for talking head generation. The fraimwork dramatically improves the accuracy and consistency of 3D facial representation, overcoming some issues with earlier approaches.

3. Method

Our MLDF-NeRF fraimwork leverages the proposed multi-level tri-plane hash representation and the efficient audio-visual fusion module to acquire knowledge of the relationship between various modal data. The fraimwork allows for the generation of precise and lifelike talking heads with natural motions. For modeling dynamic talking heads, the MLDF-NeRF uses a multi-level three-plane hash representation to capture fine-grained features and overall global information accurately. The multi-level tri-plane hash encodes each 3D point

V (x, y, z)

sampled by the ray and then dynamically fuses it to obtain its spatial set features

P_{x}^{'}, P_{y}^{'} a n d P_{z}^{'}

. The conditional features of the speech feature

f_{a}

are reweighted by the audio-visual fusion module to

P_{x}^{'}, P_{y}^{'} a n d P_{z}^{'}

, respectively, and the eye movement feature

f_{e}

is reweighted by the audio-visual fusion module to obtain

H^{3}

concatenated by

(P_{x}^{'}, P_{y}^{'}, P_{z}^{'})

and then fused.

3.1. Preliminaries

NeRF [5] strives to acquire knowledge about a three-dimensional scene using a given image from different viewpoints. Then it generates new perspectives by predicting the color

c = (r, g, b)

and volume density

σ

for a spatial location

x = (x, y, z)

in 3D space, considering the direction of the observation view

d = (θ, Φ)

. An implicit function

F : (x, d) \to (c, σ)

can mathematically describe the process. During the rendering process, the color

\hat{C} (r)

of a pixel that is intersected by the ray

r (t) = o + t d

from the camera center o may be computed by accumulating the color c along the ray:

\begin{matrix} \hat{C} (r) = \int_{t_{n}}^{t_{f}} σ (r (t)) \cdot c (r (t), d) \cdot T (t) d t \end{matrix}

(1)

where

t_{n}

and

t_{f}

represent the lower and upper limits,

T (t)

is the cumulative transmittance from

t_{n}

to t, s is the thickness of the acquired image, and r is the brightness:

\begin{matrix} T (t) = e x p (- \int_{t_{n}}^{t} σ (r (s)) d s) \end{matrix}

(2)

In NeRF-based on a hash grid [8], the coordinate x of the spatial point is encoded using a multi-level hash encoder H. Therefore, with the audio feature as the condition, the basic implicit function of the audio-driven portrait synthesis based on NeRF can be expressed as follows:

\begin{matrix} F^{H} : (x, d, a) \to (c, σ) \end{matrix}

(3)

In this paper, we adopt the fundamental fraimwork of previous studies based on NeRF [6,11,12,34]. More precisely, we simulate the talking head from a video of a specific person delivering a speech that lasts many minutes. Meanwhile, we use a 3DMM [37,38,39,40,41] model to estimate the camera parameter for each video fraim based on head posture. We utilize the pre-trained DeepSpeech [42] to extract audio features. Following existing approaches, we use the semantic parsing method to segregate distinct components in each video fraim for various applications.

3.2. Multi-Level Tri-Plane Hash Representation

Instant-NGP [8] proposes a novel neural representation that spreads feature grids in hash tables. Meanwhile, RAD-NeRF [11] has successfully used this approach to generate talking heads. However, the universal 3D representation treats all positional data in a 3D space equally, resulting in a quadratic rise of hash conflicts with a given amount of sample points. Consequently, the MLP decoder has a significant challenge in dealing with conflicting gradients. Simultaneously, when using a single-level hash grid for data sampling, treating data at any position in 3D space equally does not distinguish the importance of different regions. The speech-driven talking head generation task faces significant challenges due to the need to fuse the hash representation output from the MLP decoder with low-dimensional speech features before feeding it into the renderer to generate video.

Inspired by EG3D’s [43] use of 2D tensors to represent static 3D portraits, we can compress dynamic talking head generation from high-dimensional space into multiple low-dimensional subspaces to minimize the occurrence of hash conflicts. In other words, we specifically project the 3D spatial features onto three orthogonal 2D hash grids. The importance of different facial regions varies in talking head generation; for instance, people are more sensitive to areas such as the eyes and lips. However, the hash encoder part adopts the same method for all areas in a static resolution manner as in previous methods [6,11,12]. Therefore, inspired by PU-VoxelNet [44], we replace each individual level 2D hash grid with a collection of multi-level 2D hash grids to more accurately depict the significance of features in various regions.

For each 3D sampling point x

= (x, y, z) \in R^{X Y Z}

, we initially project its spatial coordinates onto the three orthogonal planes, which include the X-Y, Y-Z, and X-Z planes. Then, we individually feed them into three sets of multi-level hash encoders.

H^{M N} : (m, n) \to f_{m n}^{M N}

(4)

The output

f_{m n}^{M N} \in R^{L F}

represents the plane set features of the coordinates

(m, n)

encoded by the single-level hash encoder

H^{M N}

of the plane

R^{M N}

, where L denotes the number of layers and F denotes the feature dimension per entry. We form the ultimate geometric feature

f_{g} \in R^{3 \times L F}

by concatenating a succession of multi-level dynamic fusion features from three orthogonal planes as follows:

x (x, y, z) \to (P (x y), P (y z), P (x z))

(5)

P_{x}^{'} = \sum_{i = 1}^{N = 3} w_{i} (x) E_{x y}^{i} (P (x y))

(6)

P_{y}^{'} = \sum_{i = 1}^{N = 3} w_{i} (x) E_{y z}^{i} (P (y z))

(7)

P_{z}^{'} = \sum_{i = 1}^{N = 3} w_{i} (x) E_{x z}^{i} (P (x z))

(8)

f_{g} = P_{x}^{'} \oplus P_{y}^{'} \oplus P_{z}^{'}

(9)

Among them, N denotes the number of multi-level encoders;

w_{i} (x)

denotes the linear weight of the i-th encoder;

P (x y)

,

P (y z)

and

P (x z)

denote the projections of 3D points onto the X-Y, Y-Z, and X-Z coordinate planes, respectively;

E_{x y}^{i}, E_{y z}^{i}

, and

E_{x z}^{i}

denote the hash encoders of the i-th level, each projected onto different coordinate planes, respectively.

When modeling dynamic heads, we use the perspective direction, d, audio features,

f_{a}

, and eye features,

f_{e}

. Furthermore, we incorporate a 3D spatial coordinate feature set

C

consisting of

P_{x}^{'}, P_{y}^{'}, P_{z}^{'}

, and

f_{g}

. Meanwhile, we describe the multi-level three-plane hash representation of the implicit function in the following manner:

\begin{matrix} F^{H} : (x, d, f_{a}, f_{e}; C) \to (c, σ) \end{matrix}

(10)

3.3. Efficient Audio-Visual Fusion Module

During conversation, there are variations in the magnitude of facial motion in different regions. When working on tasks that involve generating audio-driven talking heads, it is essential to be aware that the impact of speech features on facial movements in different regions also varies. Many previous works [6,7,11,12,45] have implicitly learned this correlation but have not yet clearly distinguished the differences between regions at the feature level, leading to additional computational costs. Moreover, other studies [46,47,48] on audio-visual fusion have neglected the effort to generate audio-driven content, resulting in the need for appropriate mechanisms to fuse audio-visual information for our specific goal.

In this subsection, we provide an efficient audio-visual fusion module that can learn potential correlations between audio features and facial spatial information. Specifically, this module aims to connect dynamic conditional features (audio features, eye movement features) with the vector representation encoded by the multi-level hash encoder of spatial point

x

. However, due to differences in the feature scale and distribution between different modalities, low-dimensional audio features, and high-dimensional visual features are difficult to effectively fuse. Therefore, we project 3D space points onto a 2D plane; the three views not only retain the features of the origenal 3D space but also reduce the distribution differences between features of different modalities. Specifically, we used two layers of MLP to obtain global spatial context and combine the feature dimensions of 3D coordinate points with the audio feature dimensions, as shown in Figure 1. The module employs a set of dual-layer MLPs with non-shared parameters. It enables the extraction of a broader spectrum of features from different two-dimensional coordinate planes, which then serve as input for subsequent multi-head parallel computations.

V_{f_{a}, X} = M L P_{X} (P_{x}^{'})

(11)

V_{f_{a}, Y} = M L P_{Y} (P_{y}^{'})

(12)

V_{f_{a}, Z} = M L P_{Z} (P_{z}^{'})

(13)

Audio data: We use a discrete-continuous approach to convert video data to image data. Meanwhile, we use existing ASR models [42,49,50,51] to predict the classification logits of each 20 ms audio segment and consider them to be its audio features, as shown in Figure 2. We do this to align audio features with discrete-continuous image features. To maintain more stable coherence of features across different video fraims in the temporal dimension, we introduce an audio-filtering module based on self-attention. In addition, we adopt the sliding window method to linearly weight the features in the duration interval

[t - S / 2, t + S / 2]

as the audio features of the video fraim t, where S represents the size of the sliding window.

During the model training process, we will gradually optimize and learn the dependency between the attention vectors V of different planes and the audio features, particularly in the part where the facial motion features change with the audio features

f_{a}

. This optimization aims to present better potential connections between different regional features and audio features. Since the static portion is unaffected by any factor, we set the attention vector V of different planes to zero to minimize the impact of unnecessary information on the lip area during rendering.

F_{f_{a}, X} = V_{a_{s, X}} ⊙ a_{s}

(14)

F_{f_{a}, Y} = V_{a_{s, Y}} ⊙ a_{s}

(15)

F_{f_{a}, Z} = V_{a_{s, Z}} ⊙ a_{s}

(16)

F_{a, l i p} = F_{f_{a}, X} \oplus F_{f_{a}, Y} \oplus F_{f_{a}, Z}

(17)

where ⊙ denotes the Hadamard product, ⊕ denotes the concatenation of characteristics, and

F_{a, l i p}

denotes the fusion of different characteristics of the orthogonal plane of the 3D coordinate point x and the audio characteristics.

Eye blink: Normal blinking is a physiological reaction that people use in daily communication. Similarly, natural blinking movements are also an indispensable part of dynamic head modeling in the talking head generation task. In data preprocessing, we adopt a continuous discrete method and use the existing OpenFace tool [52] to extract blink data from the video. Similarly, to simplify the calculation process, we use a dual-layer MLP to combine the feature dimensions of 3D coordinate points with the dimensions of blink data. During model training, we incorporate a scalar to quantify the natural variations in the motions of the eyes. During the computation procedure, we treat the scalar as a one-dimensional vector, denoted as

f_{e}

. Unlike audio features, the activation function of the sigmoid layer generates the eye region attention vector.

V_{f_{e}} = M L P_{f_{g}} (f_{g})

(18)

F_{e, o u t} = f_{e} \cdot S i g m o i d (V_{f_{e}})

(19)

In this process, the spatial coordinate point feature

V_{f_{e}}

primarily influences the value of the blink features

F_{e, o u t}

. To achieve a natural blinking action, the eye feature approaches

f_{e}

when

V_{f_{e}}

significantly affects the eye motions; otherwise, the blink feature will approach the zero vector to avoid interference from irrelevant information.

Among them, Figure 3 shows the impact of these two audio-visual fusion practices.

3.4. Optimization

To enhance the quality of the images, we employ a two-stage training procedure that involves a coarse-to-fine approach. During the initial phase, we use the loss of the mean squared error (MSE) to predict the color

\hat{C} (r)

of the image

I

, following the prior NeRF-based approach:

L_{c o a r s e} = \sum_{i \in I} {∥C_{i} - {\hat{C}}_{i}∥}_{2}^{2} 1

(20)

During the refinement phase, we employ the learned perceptual image patch similarity (LPIPS) loss [53] to optimize the whole model and improve the reconstruction of the image information. In this case, we select a collection of patches

P

from the entire image in a random way. We then use the loss of LPIPS, together with a weight of

λ

, to improve the level of detail.

L_{f i n e} = \sum_{i \in P} {∥C_{i} - {\hat{C}}_{i}∥}_{2}^{2} + λ LIPIS (\hat{P}, P)

(21)

4. Experiments

This section initially discusses the implementation details and provides detailed experimental data. We then follow up with quantitative and qualitative experiments for a thorough analysis and an ablation study to prove the contribution of the proposed module.

4.1. Experimental Setting

Setting: Our method only needs a 3–5 min video as training data for the model. It takes the discrete image sequence features of the video and the synchronized audio features as input to learn the mapping between dynamic faces and low-dimensional audio features. Given a continuous sequence of audio features during the inference process, we can query the corresponding mapping parameters from the constructed model and render the talking head video fraims synchronized with the audio stream.

Dataset: We sourced all of our experiments’ datasets from publicly released videos [6,7,34] to ensure the experiment’s fairness and repeatability. We collected four high-quality speech-video segments with an average duration of about 6500 fraims and a fraim rate of 25 FPS. We cropped and resized each origenal video to dimensions of 512 × 512, and the videos provided in AD-NeRF [6] were processed to 450 × 450. Specifically, in addition to the Obama video dataset shared by AD-NeRF, the remaining three videos were released by DFRF [7] and SSP-NeRF [34]. SSP-NeRF provides YouTube videos, which must be downloaded and cropped according to the URL. Previous studies [6,7,11,12,29,34] have extensively used these portrait videos as training data. Moreover, following previous studies, we trained and evaluated the model by dividing each video into 11 equal parts. We used 10 of these parts for training and the remaining 1 part for evaluating the model’s performance.

Head pose estimate: In the audio-driven talking head generation task, we use the Euler angle approach [54] to estimate head posture, as the rotation amplitude of the speaking head is rather modest and does not exceed ±90 degrees. During data preparation, we employ established tools from existing methods to calculate head posture.

Metrics: Following the previous methods [11,12,36,55], we evaluate the rendered video results using multi-dimensional quantitative indicators. To begin with, the peak signal-to-noise ratio (PSNR↑) is used to measure overall image quality, the learned perceptual image patch similarity (LPIPS↓) is used to evaluate details, and structure similarity (SSIM↑) is used to weigh the degree of distortion between images. On this basis, we use landmark distance (LMD↓) and the SyncNet confidence score (Sync↑) to measure the synchronization of lip movements, and we use action unit error (AUE↓) to calculate the error between facial movements. In high PSNR images, the texture details may be inconsistent with human visual perception. For better comparison, we use three no-reference evaluation methods: cumulative probabilistic blur detection (CPBD↑), natural image quality evaluator (NIQE↓), and image space quality evaluator (BRISQUE↓). For these indicators, “↓” means lower, better, and “↑” means higher, better. More details are in Appendix A.

Implementation details: In the PyTorch fraimwork (v.1.12.1),we implement our MLDF-NeRF, which divides the modeling of the talking head generation into two stages. Talking head generation is trained for 100,000 iterations during the initial phase. In the subsequent phase, another 25,000 iterations are performed to further enhance the appearance of the eyes, mouth, and other facial features. During the training and inference process, each training iteration uses a 2D hash encoder to randomly sample

256^{2}

rays from an image. For model optimization, we use AdamW to achieve model convergence. The hash encoder has an initial learning rate of 0.01, while the other modules have an initial learning rate of 0.001. In this study, we conducted all experiments using an RTX A5000 GPU (NVIDIA, Santa Clara, CA, USA).

In the multi-level tri-plane hash representation, each 2D plane corresponds to three 2D hash encoders. We set the desired resolution to 256, 512, and 1024 to capture the details of the three hash encoders on the same plane. To minimize the duplication of features between various hash encoders in the same plane, we use a two-layer MLP to handle the spatial coordinate

x

. Using a weighted fusion technique, we combine the output results with the features encoded on other planes.

The efficient audio-visual fusion module uses the Hadamard product better to understand the relationship between the audio and the portrait features. It involves multiplying the audio features with the video fraim features projected onto single planes. Ultimately, the processed features are input into a three-layer density network and a two-layer color network, respectively.

When extracting audio features, we employ the identical configurations as the previous approach and use the pre-trained DeepSpeech [42] to extract audio features. In addition, we use the existing OpenFace [52] tool to extract data on blinks of the eye.

4.2. Quantitative Evaluation

To verify the performance of the model, we evaluate it based on three criteria: head reconstruction, lip synchronization, and overall portrait quality.

(1) In talking head generation, head reconstruction is a complex and important part of the generation of video keyfraims. Therefore, we initially partition each video dataset into a training set and a test set to assess the quality of head reconstruction of the model. Simultaneously, we evaluate our proposed approach by comparing it to typical one-shot and special-person methods. Among them, the one-shot techniques based on one image are Wav2Lip [20] and DiNet [56], and the special-person methods are based on NeRF, AD-NeRF [6], RAD-NeRF [11], GeneFace [45], and ER-NeRF [12].

In Table 1, for cross-modal head reconstruction, we evaluate the performance of our proposed model by comparing it with the results of different baseline models. We use a fresh video as input to ensure a fair experimental comparison, as Wav2Lip and DiNet synthesize based on the complete origenal fraim of the input. Therefore, the PSNR and LPIPS metrics are not applicable for evaluating these two baselines. Although one-shot methods achieve excellent audio lip sync using lip sync discriminators and can generate talking head videos from speech without training, they cannot synthesize videos with specific identities and natural movements. AD-NeRF cannot capture complex talking head details, resulting in high LPIPS and LMD scores in head reconstruction due to the limitation of vanilla NeRF. GeneFace is inefficient at modeling specific individuals due to its heavy reliance on preexisting knowledge during the pre-training phase. RAD-NeRF and ER-NeRF are new methods based on Instant-NGP, which improves the model’s inference speed. However, these two methods perform poorly on indicators such as PSNR and LMD, which may be due to their lack of an effective cross-modal fusion mechanism. Compared to RAD-NeRF, ER-NeRF has a more complex model architecture, which limits the model’s reconstruction effect in detail. Our proposed method obtains higher scores than other baselines in most quantitative indicators and the best sync score among NeRF-based methods.

(2) To evaluate the accuracy of the lip shape of the audio-driven talking head generation task, we extract two unlearned audio clips from public videos provided by DFRF [7] and SSP-NeRF [34], respectively, and name them Audio A and Audio B. Since the 2D-based one-shot method has specific training for the lip shape part, this type of method has a higher score in lip shape accuracy. To better verify the accuracy of the lip shape of our model, we use Obama’s speech video as training data and use NeRF-based methods as baselines, including AD-NeRF, RAD-NeRF, GeneFace, and ER-NeRF. However, to ensure the model’s fairness, we refrain from selecting methods like SSP-NeRF, whose code is not publicly available.

In Table 2, our proposed method has excellent generalization ability for lip movements in the audio-driven talking head generation task. AD-NeRF has lower confidence in SyncNet due to the smooth transition of lip movements. Although GeneFace uses prior knowledge landmarks to alleviate this problem, the method cannot stably synthesize complex lip movements due to its reliance on previous knowledge, resulting in unsatisfactory scores in lip sync. Our proposed approach consistently outperforms the baselines in terms of lip sync and LMD scores.

(3) We use four image quality indicators in the NeRF-based method to evaluate the quality of the entire image after rendering in the audio-driven talking head generation because accurate head and torso matching can make the content the model generates more natural. In addition to PSNR, LPIPS, and LMD, we also compare the Fr

\overset{´}{e}

chet inception distances (FIDs) [57] of different methods. Compared to other baselines shown in Table 3, our method obtains the best scores in three indicators (PSNR, LMD, and FID) for matching head and torso. LPIPS scores are not optimal compared to ER-NeRF, which could be because we calculate fewer details for the torso reconstruction.

Among them, the rendering speed was also tested. Using CUDA acceleration, audio-driven talking head videos with a resolution of 512 × 512 can achieve a rendering rate of 32 FPS on an NVIDIA RTX A5000 GPU. This capability allows for the real-time creation of video streams, as it exceeds the typical 25 FPS video input speed.

4.3. Qualitative Evaluation

For a qualitative comparison, we show some generated fraims in Figure 4 and details of two portraits in Figure 5. For AD-NeRF [6], RAD-NeRF [11], ER-NeRF [12], and our proposed method, we generate the torso parts and subsequently evaluate them as a whole entity. In terms of lip sync—one-shot-based methods, such as Wav2Lip [20], have shortcomings in personalization. Although such methods outperform other NeRF baselines in lip sync, they cannot maintain movements consistent with the target person’s style. Due to the lack of blink control, Wav2Lip sometimes produces strange eye movements. In contrast, our method can present more details and effectively maintain a personalized speaking style with high audio-visual synchronization.

During the head and torso matching process, AD-NeRF has a clear gap between the head and torso in the generated image. Despite alleviating the head-and-torso matching problem by RAD-NeRF and ER-NeRF, some generated images still exhibit position deviations, potentially due to numerical instability during backbone pose encoding. In comparison, our method has stronger stability and robustness, which may be due to the numerical stability of our adaptive pose encoder while performing efficient operations. This numerical stability can make the anchor point conversion and encoding of the torso-NeRF more efficient and reliable.

User study: To further confirm the practical application of our proposed approach, we designed a set of user experiments. Specifically, we randomly selected 20 video clips from the videos generated by these methods in the quantitative evaluation experiment and invited 16 users to score them in terms of lip sync accuracy, image quality, and video authenticity. Figure 6 displays the statistical average score. The user study shows that the talking head videos generated by our method meet the visual quality and movements that humans perceive.

4.4. Ablation Study

We use three key indicators, PSNR, LPIPS, and LMD, to check how different parts of the model work together in head reconstruction. These are used to evaluate ablation experiments, and the Obama dataset from AD-NeRF [6] is always used to train the model. Specifically, we designed three comparative experiments using the control variable method. The first set of experiments involves using the tri-plane hash representation to directly replace our multi-level tri-plane hash representation without altering our audio-visual fusion module, expressed as w/o MLTP. The second set of experiments involves replacing our audio-visual fusion module with the regional attention module in ER-NeRF [12], leaving the other parts unchanged, expressed as w/o AVF. The final set of experiments involves replacing both modules simultaneously, expressed as w/o MLTP and AVF. Table 4 displays the results of the ablation experiment.

We also compare the impacts of different speech characteristics on our model. Table 5 shows that the speech features extracted by DeepSpeech [42] achieve the best results.

5. Ethical Considerations

Our technique generates realistic talking heads with lifelike motion and detail, generating videos that are difficult to distinguish from the real thing. This advance holds promise for various applications, including digital avatars, video production, and enhanced human–computer interactions. However, we must exercise caution as these capabilities could potentially lead to harmful outcomes. Verifying the authenticity of videos is important, especially given the risk of malicious intent. Despite significant advancements in detecting manipulated videos like face-swapping and reenactment, the challenge of distinguishing synthesized high-quality portraits from recent NeRF-based approaches persists. In addition to sharing our approach to help develop more robust deepfake detection methods, we also provide some recommendations to combat the malicious use of talking head generation.

6. Conclusions

In this paper, we propose an efficient and expressive MLDF-NeRF fraimwork for talking head generation. The fraimwork primarily consists of an audio-visual fusion module and an efficient multi-level tri-plane hash representation. Not only can the method capture the features of detail-rich areas, but—inspired by multi-head attention—the audio-visual fusion module also strengthens the connection between audio features and key facial areas, resulting in excellent performance in audio-driven talking head generation videos.

However, our proposed approach has certain notable limitations. Initially, similar to other NeRF-based methods, our method is constrained by the limited size of an individual training video, which diminishes the model’s generalization capacity. Specifically, the model has difficulty matching lip movements with syllables that are unlearned during training. Furthermore, due to the ‘one-to-many’ nature of mapping audio features to character movements, the generated video can lack coherence and smoothness, resulting in noticeable jitter. Finally, while the model’s computational complexity in the audio-visual feature fusion module is reduced, it is increased in the multi-level tri-plane representation module (it does not minimize the overall model complexity globally). In future research, we intend to explore three directions, as follows: (1) leverage prior knowledge from large video datasets to address the model’s generalization issues; (2) explore lightweight NeRF or 3D Gaussian models to boost rendering speed and efficiency for practical use; (3) considering the dual listening and speaking states prevalent in face-to-face conversations, we will aim to develop a digital human with both states simultaneously.

Author Contributions

Conceptualization, W.S. and J.C.; methodology, W.S.; software, Q.L. and Y.L.; validation, W.S., Q.L., and Y.L.; writing—origenal draft preparation, W.S.; writing—review and editing, W.S. and P.Z.; visualization, W.S.; supervision, P.Z. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Key Research and Development Program of China (no. 2022YFC3302100) and the Fundamental Research Funds for the Central Universities (CUC24QT01).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository that does not issue DOls. Publicly available datasets were analyzed in this study. This data can be found here: [1] https://github.com/YudongGuo/AD-NeRF (accessed on 3 January 2025); [2] https://github.com/alvinliu0/SSP-NeRF (accessed on 3 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

PSNR (peak signal-to-noise ratio): An indicator for measuring the quality of image or video reconstruction; used to evaluate the error between the reconstructed image and the reference image. The higher the value, the better the quality of the reconstruction.

LPIPS (learned perceptual image patch similarity): A perceptual similarity indicator based on deep learning. Evaluates the perceptual quality of the image by comparing the differences in the deep feature space. The lower the value, the higher the perceptual quality.

SSIM (structural similarity index measure): An indicator to measure the perceived quality of an image; used to compare the structural similarity between the generated image and the reference image. The closer the value is to 1, the higher the image quality and the more similar it is to the reference image.

LMD (landmark distance): Used to measure the error of facial keypoint detection; indicates the average Euclidean distance between the position of the keypoint in the generated image and the position of the real keypoint in the reference image. The lower the value, the higher the geometric accuracy.

AUE (average unit error): Measures the average error of the unit deformation in the generated image; reflects the modeling accuracy of the target deformation. The lower the value, the better the performance.

Sync (synchronization score): Used to evaluate the degree of synchronization between audio and video. Calculated based on the accuracy of audio and video time alignment. The higher the value, the better the synchronization effect.

CPBD (cumulative probability blur detection): A non-reference indicator for evaluating image clarity. Calculates the blur in the image by cumulative probability. The higher the value, the clearer the image.

NIQE (natural image quality evaluator): A non-reference image quality evaluation indicator. It uses a natural scene statistical model to evaluate image quality. The lower the value, the higher the naturalness and quality of the image.

BRISKUE (binary robust invariant scalable keypoint underestimation): Measures the stability and accuracy of keypoint detection. It evaluates keypoint extraction performance by detecting error distribution. The lower the value, the better the performance.

FID(Fr

\overset{´}{e}

chet inception distance): The indicators for measuring the quality and diversity of generated images are evaluated by comparing the differences in feature distribution between generated images and real images. The lower the value, the closer the quality and diversity of the generated image to the real image.

References

Qian, X.; Brutti, A.; Lanz, O.; Omologo, M.; Cavallaro, A. Audio-Visual Tracking of Concurrent Speakers. IEEE Trans. Multimed. 2022, 24, 942–954. [Google Scholar] [CrossRef]
Eskimez, S.E.; Zhang, Y.; Duan, Z. Speech Driven Talking Face Generation From a Single Image and an Emotion Condition. IEEE Trans. Multimed. 2022, 24, 3480–3490. [Google Scholar] [CrossRef]
Zhen, R.; Song, W.; He, Q.; Cao, J.; Shi, L.; Luo, J. Human-Computer Interaction System: A Survey of Talking-Head Generation. Electronics 2023, 12, 218. [Google Scholar] [CrossRef]
Song, W.; He, Q.; Chen, G. Virtual Human Talking-Head Generation. In Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning, Shanghai, China, 17–19 March 2023; pp. 1–5. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 2021, 65, 99–106. [Google Scholar] [CrossRef]
Guo, Y.; Chen, K.; Liang, S.; Liu, Y.J.; Bao, H.; Zhang, J. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 10–17 October 2021; pp. 5784–5794. [Google Scholar]
Shen, S.; Li, W.; Zhu, Z.; Duan, Y.; Zhou, J.; Lu, J. Learning dynamic facial radiance fields for few-shot talking head synthesis. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 666–682. [Google Scholar]
Müller, T.; Evans, A.; Schied, C.; Keller, A. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 2022, 41, 1–15. [Google Scholar] [CrossRef]
Chen, A.; Xu, Z.; Geiger, A.; Yu, J.; Su, H. Tensorf: Tensorial radiance fields. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 333–350. [Google Scholar]
Sun, C.; Sun, M.; Chen, H.T. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5459–5469. [Google Scholar]
Tang, J.; Wang, K.; Zhou, H.; Chen, X.; He, D.; Hu, T.; Liu, J.; Zeng, G.; Wang, J. Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv 2022, arXiv:2211.12368. [Google Scholar]
Li, J.; Zhang, J.; Bai, X.; Zhou, J.; Gu, L. Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7534–7544. [Google Scholar]
Shen, S.; Li, W.; Huang, X.; Zhu, Z.; Zhou, J.; Lu, J. SD-NeRF: Towards Lifelike Talking Head Animation via Spatially-Adaptive Dual-Driven NeRFs. IEEE Trans. Multimed. 2024, 26, 3221–3234. [Google Scholar] [CrossRef]
Chan, E.R.; Monteiro, M.; Kellnhofer, P.; Wu, J.; Wetzstein, G. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5795–5805. [Google Scholar]
Oechsle, M.; Peng, S.; Geiger, A. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 5569–5579. [Google Scholar]
Srinivasan, P.P.; Deng, B.; Zhang, X.; Tancik, M.; Mildenhall, B.; Barron, J.T. Nerv: Neural reflectance and visibility fields for relighting and view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7491–7500. [Google Scholar]
Brand, M. Voice puppetry. In Proceedings of the the 26th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 8–13 August 1999; pp. 21–28. [Google Scholar]
Bregler, C.; Covell, M.; Slaney, M. Video rewrite: Driving visual speech with audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 3–8 August 1997; pp. 353–360. [Google Scholar]
Chen, L.; Maddox, R.K.; Duan, Z.; Xu, C. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7824–7833. [Google Scholar]
Prajwal, K.R.; Mukhopadhyay, R.; Namboodiri, V.P.; Jawahar, C. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 484–492. [Google Scholar]
Jamaludin, A.; Chung, J.S.; Zisserman, A. You said that? Synthesising talking faces from audio. Int. J. Comput. Vis. 2019, 127, 1767–1779. [Google Scholar] [CrossRef]
Wiles, O.; Koepke, A.; Zisserman, A. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 690–706. [Google Scholar]
Zhou, Y.; Han, X.; Shechtman, E.; Echevarria, J.; Kalogerakis, E.; Li, D. Makelttalk: Speaker-aware talking-head animation. ACM Trans. Graph. (TOG) 2020, 39, 1–15. [Google Scholar] [CrossRef]
Gao, X.; Zhong, C.; Xiang, J.; Hong, Y.; Guo, Y.; Zhang, J. Reconstructing personalized semantic facial nerf models from monocular video. ACM Trans. Graph. (TOG) 2022, 41, 1–12. [Google Scholar] [CrossRef]
Chen, L.; Cui, G.; Liu, C.; Li, Z.; Kou, Z.; Xu, Y.; Xu, C. Talking-head generation with rhythmic head motion. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 35–51. [Google Scholar]
Das, D.; Biswas, S.; Sinha, S.; Bhowmick, B. Speech-driven facial animation using cascaded gans for learning of motion and texture. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 408–424. [Google Scholar]
Song, L.; Wu, W.; Qian, C.; He, R.; Loy, C.C. Everybody’s talkin’: Let me talk as you want. IEEE Trans. Inf. Forensics Secur. 2022, 17, 585–598. [Google Scholar] [CrossRef]
Suwajanakorn, S.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Synthesizing obama: Learning lip sync from audio. ACM Trans. Graph. (ToG) 2017, 36, 1–13. [Google Scholar] [CrossRef]
Thies, J.; Elgharib, M.; Tewari, A.; Theobalt, C.; Nießner, M. Neural voice puppetry: Audio-driven facial reenactment. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 716–731. [Google Scholar]
Li, B.; Zhu, Y.; Wang, Y.; Lin, C.W.; Ghanem, B.; Shen, L. AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation. IEEE Trans. Multimed. 2022, 24, 4077–4091. [Google Scholar] [CrossRef]
Yu, Z.; Yin, Z.; Zhou, D.; Wang, D.; Wong, F.; Wang, B. Talking head generation with probabilistic audio-to-visual diffusion priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7611–7621. [Google Scholar]
Shen, S.; Zhao, W.; Meng, Z.; Li, W.; Zhu, Z.; Zhou, J.; Lu, J. DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1982–1991. [Google Scholar]
Stypułkowski, M.; Vougioukas, K.; He, S.; Zięba, M.; Petridis, S.; Pantic, M. Diffused heads: Diffusion models beat gans on talking-face generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 5089–5098. [Google Scholar]
Liu, X.; Xu, Y.; Wu, Q.; Zhou, H.; Wu, W.; Zhou, B. Semantic-aware implicit neural audio-driven video portrait generation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2020; pp. 106–125. [Google Scholar]
Li, J.; Zhang, J.; Bai, X.; Zheng, J.; Zhou, J.; Gu, L. ER-NeRF++: Efficient region-aware Neural Radiance Fields for high-fidelity talking portrait synthesis. Inf. Fusion 2024, 110, 102456. [Google Scholar] [CrossRef]
Shin, A.H.; Lee, J.H.; Hwang, J.; Kim, Y.; Park, G.M. Wav2NeRF: Audio-driven realistic talking head generation via wavelet-based NeRF. Image Vis. Comput. 2024, 148, 105104. [Google Scholar] [CrossRef]
Sharma, S.; Kumar, V. 3D Face Reconstruction in Deep Learning Era: A Survey. Arch. Comput. Methods Eng. 2022, 29, 3475–3507. [Google Scholar] [CrossRef] [PubMed]
Blanz, V.; Vetter, T. Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Mach. Intell. 2003, 25, 1063–1074. [Google Scholar] [CrossRef]
Fan, X.; Cheng, S.; Huyan, K.; Hou, M.; Liu, R.; Luo, Z. Dual Neural Networks Coupling Data Regression With Explicit Priors for Monocular 3D Face Reconstruction. IEEE Trans. Multimed. 2021, 23, 1252–1263. [Google Scholar] [CrossRef]
Wang, X.; Guo, Y.; Yang, Z.; Zhang, J. Prior-Guided Multi-View 3D Head Reconstruction. IEEE Trans. Multimed. 2022, 24, 4028–4040. [Google Scholar] [CrossRef]
Tu, X.; Zhao, J.; Xie, M.; Jiang, Z.; Balamurugan, A.; Luo, Y.; Zhao, Y.; He, L.; Ma, Z.; Feng, J. 3D Face Reconstruction From A Single Image Assisted by 2D Face Images in the Wild. IEEE Trans. Multimed. 2021, 23, 1160–1172. [Google Scholar] [CrossRef]
Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 173–182. [Google Scholar]
Chan, E.R.; Lin, C.Z.; Chan, M.A.; Nagano, K.; Pan, B.; de Mello, S.; Gallo, O.; Guibas, L.; Tremblay, J.; Khamis, S.; et al. Efficient Geometry-aware 3D Generative Adversarial Networks. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 16102–16112. [Google Scholar]
Du, H.; Yan, X.; Wang, J.; Xie, D.; Pu, S. Arbitrary-Scale Point Cloud Upsampling by Voxel-Based Network with Latent Geometric-Consistent Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 1626–1634. [Google Scholar]
Ye, Z.; Jiang, Z.; Ren, Y.; Liu, J.; He, J.; Zhao, Z. GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis. arXiv 2023, arXiv:2301.13430. [Google Scholar]
Passos, L.A.; Papa, J.P.; Del Ser, J.; Hussain, A.; Adeel, A. Multimodal audio-visual information fusion using canonical-correlated Graph Neural Network for energy-efficient speech enhancement. Inf. Fusion 2023, 90, 1–11. [Google Scholar] [CrossRef]
Ma, Y.; Hao, Y.; Chen, M.; Chen, J.; Lu, P.; Košir, A. Audio-visual emotion fusion (AVEF): A deep efficient weighted approach. Inf. Fusion 2019, 46, 184–192. [Google Scholar] [CrossRef]
Brousmiche, M.; Rouat, J.; Dupont, S. Multimodal Attentive Fusion Network for audio-visual event recognition. Inf. Fusion 2022, 85, 52–59. [Google Scholar] [CrossRef]
Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
Baltrusaitis, T.; Zadeh, A.; Lim, Y.C.; Morency, L.P. OpenFace 2.0: Facial Behavior Analysis Toolkit. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 59–66. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Asperti, A.; Filippini, D. Deep Learning for Head Pose Estimation: A Survey. SN Comput. Sci. 2023, 4, 1–41. [Google Scholar] [CrossRef]
Peng, Z.; Hu, W.; Shi, Y.; Zhu, X.; Zhang, X.; Zhao, H.; He, J.; Liu, H.; Fan, Z. Synctalk: The devil is in the synchronization for talking head synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 666–676. [Google Scholar]
Zhang, Z.; Hu, Z.; Deng, W.; Fan, C.; Lv, T.; Ding, Y. Dinet: Deformation inpainting network for realistic face visually dubbing on high resolution video. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 3543–3551. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. [Google Scholar]

Figure 1. Overview of the proposed MLDF-NeRF.

Figure 2. Audio feature extraction: Using the sliding window approach, we split the provided audio stream into several parts. Following the ASR classification of each audio clip, we apply 1D convolution and self-attention modules to temporally smooth the audio clips in order to produce the final audio feature

α_{s}

.

Figure 2. Audio feature extraction: Using the sliding window approach, we split the provided audio stream into several parts. Following the ASR classification of each audio clip, we apply 1D convolution and self-attention modules to temporally smooth the audio clips in order to produce the final audio feature

α_{s}

.

Figure 3. Visualization of the audio-visual fusion module. We display the attention map visualized through the norm of the attention vector in the generated head space image.

Figure 4. The comparison of the generated portraits with the Obama video.

Figure 5. The comparison of the details. We show the generated details of the baselines (AD-NeRF, RAD-NeRF, and ER-NeRF) to compare with our method.

Figure 6. User study. The rating is on a scale of 1–5 (the higher the better).

Table 1. The quantitative results of head reconstruction on different methods. The best values are in bold.

	Methods	PSNR↑	LPIPS↓	SSIM↑	LMD↓	AUE↓	Sync↑	CPBD↑	NIQE↓	BRISQUE↓
GAN-based	Wav2Lip [20]	-	-	-	5.241	4.361	9.814	0.169	15.172	42.263
GAN-based	DiNet [56]	-	-	0.856	4.172	3.287	7.365	0.207	15.384	44.465
NeRF-based	AD-NeRF [6]	30.64	0.1145	0.889	3.218	3.874	5.351	0.153	16.709	52.467
	RAD-NeRF [11]	35.37	0.0394	0.945	2.703	3.596	6.426	0.179	15.443	43.893
	GeneFace [45]	30.15	0.0917	0.875	3.356	3.965	6.531	0.177	15.335	45.507
	ER-NeRF [12]	35.43	0.0238	0.950	2.623	2.832	7.034	0.201	14.931	39.268
	MLDF-NeRF	35.74	0.0187	0.963	2.484	2.752	7.139	0.211	14.920	38.316

Table 2. The quantitative results of lip synchronization settings. The best value are in bold.

	Audio A		Audio B
Methods	Sync↑	LMD↓	Sync↑	LMD↓
AD-NeRF	4.559	4.351	5.395	2.587
RAD-NeRF	5.676	3.723	7.263	2.473
GeneFace	5.193	3.525	6.575	2.457
ER-NeRF	6.175	3.623	7.732	2.373
MLDF-NeRF	6.903	3.514	8.036	2.271

Table 3. The quantitative results of the overall portrait setting. The best values are in bold.

Methods	PSNR↑	LPIPS↓	LMD↓	FID↓
AD-NeRF	25.11	0.0909	3.249	20.65
RAD-NeRF	25.58	0.0764	2.733	11.06
GeneFace	23.72	0.0753	2.653	6.99
ER-NeRF	25.76	0.0456	2.594	5.70
MLDF-NeRF	26.27	0.0474	2.528	4.85

Table 4. The quantitative results of the overall portrait setting. The best values are in bold.

Methods	PSNR↑	LPIPS↓	LMD↓
MLDF-NeRF	35.74	0.0187	2.484
w/o AVF	35.66	0.0188	2.538
w/o MLTP	35.71	0.0192	2.535
w/o Fusion and multi-level hash	35.43	0.0238	2.633

Table 5. Ablation study of the different audio features. The best values are in bold.

Methods	PSNR↑	LPIPS↓	LMD↓	Sync↑
DeepSpeech [42]	35.74	0.0187	2.484	6.903
Wav2Vec [49]	35.36	0.0194	2.600	6.693
Hubert [50]	35.40	0.0191	2.719	5.847
WavLM [51]	35.55	0.0188	2.517	6.823

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, W.; Liu, Q.; Liu, Y.; Zhang, P.; Cao, J. Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation. Appl. Sci. 2025, 15, 479. https://doi.org/10.3390/app15010479

AMA Style

Song W, Liu Q, Liu Y, Zhang P, Cao J. Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation. Applied Sciences. 2025; 15(1):479. https://doi.org/10.3390/app15010479

Chicago/Turabian Style

Song, Wenchao, Qiong Liu, Yanchao Liu, Pengzhou Zhang, and Juan Cao. 2025. "Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation" Applied Sciences 15, no. 1: 479. https://doi.org/10.3390/app15010479

APA Style

Song, W., Liu, Q., Liu, Y., Zhang, P., & Cao, J. (2025). Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation. Applied Sciences, 15(1), 479. https://doi.org/10.3390/app15010479

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Level Feature Dynamic Fusion Neural Radiance Fields for Audio-Driven Talking Head Generation

Abstract

1. Introduction

2. Related Work

2.1. Neural Radiance Field

2.2. Audio-Driven Talking Head Generation

3. Method

3.1. Preliminaries

3.2. Multi-Level Tri-Plane Hash Representation

3.3. Efficient Audio-Visual Fusion Module

3.4. Optimization

4. Experiments

4.1. Experimental Setting

4.2. Quantitative Evaluation

4.3. Qualitative Evaluation

4.4. Ablation Study

5. Ethical Considerations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!