Image Processing 4
Image Processing 4
CITATION KEYWORDS
Bigioi D and Corcoran P (2023),
Multilingual video dubbing—a technology talking head generation, dubbing, deep fakes, deep learning, artificial intelligence, video
review and current challenges. synthesis, audio video synchronisation
Front. Sig. Proc. 3:1230755.
doi: 10.3389/frsip.2023.1230755
When content is professionally dubbed a voice actor will desired languages. For a high quality dub, care must be taken
carefully work to align the translated text with the original actors to ensure that the voice actors can accurately portray the range
facial movements and expressions. This is a challenging and skilled of emotions of the original recording, and that their voices
task and it is difficult to find multi-lingual voice actors, so often only suitably match the on-screen character. This is a costly, and
the lead actors in a movie will be professionally overdubbed. This time-consuming endeavour, and would benefit immensely
creates an “uncanny valley” effect for most overdubbed content from automation. Despite incredible advances in text-to-
which detracts from the viewing experience and it is often preferable speech, and voice-cloning technologies in recent years, a lot
to view content in the original language with subtitles. Thus the of work still remains to be able to truly replicate the skill of a
overdubbing of digital content remains a significant challenge for the professional voice actor (Weitzman, 2023), however for
video streaming industry (Spiteri Miggiani, 2021). projects where quality is not as important, text to speech is
For the best quality of experience in viewing multi-lingual an attractive option due to its reduced cost.
content it is desirable not only to overdub the speech track for a • Audio Visual Mixing: As soon as the new language voice
character, but also to adjust their facial expressions, particularly the recordings are obtained, the final step is to combine them with
lip and jaw movements to match the speech dubbing. This requires a the original video recording in as seamless a manner as
subtle adjustment of the original video content for each available possible. Traditionally this involves extensive manual
language track, ensuring that while the lip and jaw movements editing work in order to properly align and synchronise the
change in response to the new language track, the overall new audio to the original video performance. Even the most
performance of the original language actor is not diminished in skilled of editors however cannot truly synchronise these two
any way. But achieving this seamless audio driven automatic streams. High quality dubbing work is enjoyable to watch yet
dubbing is a non-trivial task, with many approaches proposed oftentimes it is still noticeable that the content is dubbed. Poor
over the last half-decade tackling this problem. Deep learning quality dubbing work detracts from the user experience,
techniques especially have proven popular in this domain (Yang oftentimes inducing the “uncanny-valley” effect in viewers.
et al., 2020; Vougioukas et al., 2020; Thies et al., 2020; Song et al.,
2018; Wen et al., 2020), demonstrating compelling results on the Due to the recent advancements in deep learning, there is scope
tasks of automatic dubbing, and the lesser constrained, more well- for automation in each of the traditional dubbing steps. Manual
known task of “talking head generation.” language translation can be carried out automatically by large
In this article, current state-of-the-art approaches are discussed language models such as Duquenne et al. (2023). Traditional
with reference to the most recent and relevant works in automatic voice acting can be replaced by powerful text to speech models
dubbing and the closely related field of talking head generation. A such as Łańcucki (2021); Liu et al. (2023); Wang et al. (2017). Audio-
taxonomy of papers within both fields is presented, and current SotA visual mixing, can then be carried out by talking head generation/
for both audio-driven automatic dubbing, and talking head video editing models such as Zhou et al. (2020). Given the original
generation are discussed and outlined. Recent approaches can be video and language streams, the following is an example of what
broadly classified as falling within two main schools of thought: end- such an automatic dubbing pipeline might look like for dubbing an
to-end, or structural-based generation (Liang et al., 2022). It is clear English language video into German:
from this review that much of the foundation technology is now
available to tackle photo-realistic multilingual dubbing, but there are • Transcribing and Translating Source Audio: Using an off-the-
still remaining challenges which we seek to define and clarify in our shelf automatic speech recognition model, an accurate
concluding discussion. transcript can be produced from the speech audio. The
English transcript can then be translated into German
using a large language model such as BERT or
2 The high-level dubbing pipeline GPT3 finetuned on the language to language translation task.
• Synthesizing Audio: Synthetic speech can be produced by
Traditionally, dubbing is a costly post-production affair that leveraging a text to speech model, taking the translated
consists of three primary steps: transcript as input, and outputting realistic speech. Ideally
the model would be finetuned on the original actors voice, and
• Translation: This is the process of taking the script of the original produce high quality speech that sounds just like the original
video, and translating it to the desired language(s). Traditionally, actor but in a different language.
this is done by hiring multiple language experts, fluent in both the • 3D Character Face Extaction: From the video stream, detect
original, and target languages. With the emergence of large and isolate the target character. Map the target characters face
language models in recent years however, accurate automatic onto a 3D morphable model using monocular 3D
language to language translation is becoming a reality (Duquenne reconstruction, and isolate the headpose/global head
et al., 2023), and has been adopted into industry use as early as movement, obtaining a static 3D face. Remove the original
2020 by the likes of Netflix (Alarcon, 2023). That being said, the lip/jaw movements, but retain the overall facial expressions
models are not perfect and are susceptible to mistranslations, and eye blinks on the character model.
therefore to ensure quality an expert is still required to look over • Facial Animation Generation: Generate the expression
the translated script. parameters corresponding to the lip and jaw movements on
• Voice Acting: Once the scripts have been translated, the next the 3D face model in response to the driving synthetic German
step is to identify and hire suitable voice actors for each of the audio speech signal via a recurrent neural network. Introduce
FIGURE 1
A high-level diagram depicting the automatic dubbing process described in this section. 3D model image is taken from the work of Cudeiro et al.
(2019), while the subject displayed is part of the Cooke et al. (2006) dataset.
the global head movement information back to the 3D model overall head movement must also be realistic, eye blinking consistent
to obtain a 3D head whose facial expressions and head pose to the speaker should be present, and the expressions on the face
correspond to the original performance, but with the lips and should match the tone and content of the speech. While many
jaws modified in response to the new audio. talking head approaches have been proposed in recent years, each
• Rendering: Mask out the facial region of the character in the addressing some or all of the aforementioned issues to various
original video, insert the newly generated 3D face model on degrees, there is plenty of scope for researchers to further the
top, and utilise an image-to-image translation network to field, as this article will demonstrate.
generate the final photorealistic output frames. As touched upon earlier, the task of audio driven automatic
dubbing is a constrained version of the talking head generation
The hypothetical pipeline described above is known as a problem. Instead of creating an entire video from scratch, the goal is
structural-based approach, and is Figure 1. The next section shall to alter an existing video, resynchronizing the lip and jaw
go into more detail on popular structural-based approaches, as well movements of the target actor in response to a new input audio
as end-to-end methods for talking head generation, audio driven signal. Unlike talking head generation, factors such as head motion,
automatic dubbing/audio driven video editing. eye blinks, and facial expressions are already present in the original
The scope of this article is limited to discussions surrounding video. The challenge lies in seamlessly altering the lip and jaw
state of the art works tackling facial animation generation, namely, content of the video, while keeping the performance of the actor as
we explore the recent trends in talking head generation, and audio close to the original as possible, so as to not detract from it.
driven automatic dubbing/video editing. The rest of the papers is
organised as follows: Section 3 provides a detailed discussion on
methods that seek to tackle the talking head generation, and 3.1 End-to-end vs. structural-based
automatic dubbing, classifying them as either end-to-end or generation
structural-based methods, and discussing their merits and pitfalls.
Section 4 provides details on popular datasets used to train models At a high level, existing deep learning approaches to both tasks
for these tasks, as well as a list of common evaluation metrics used to can be broken down into two main methods: end-to-end or
quantify the performance of such models. Section 5 provides structural-based generation. Each method has its own set of
discussion on open challenges within the field, and how advantages and disadvantages, which we will now go over.
researchers have been tackling them, before concluding the paper
in Section 6. 3.1.1 Pipeline complexity and model latency
End-to-end approaches offer the advantage of a simpler
pipeline, enabling faster processing and reduced latency in
3 Taxonomy of talking head generation generating the final output. With fewer components and
and automatic dubbing streamlined computations, real-time synthesis becomes
achievable. However, the actual performance relies on crucial
Talking head generation can be defined as the creation of a new factors like the chosen architecture, model size, and output frame
video from a single source image or handful of frames, and a driving size. For example, GAN-based end-to-end methods can achieve real-
speech audio input. There are many challenges associated with this time results, but they are often limited to lower output resolutions,
(Chen et al., 2020a). Not only must the generated lip and jaw such as 128 × 128 or 256 × 256. Diffusion-based approaches are even
movements be correctly synchronised to the speech input, but the slower, often taking seconds or even minutes per frame, even with
more efficient sampling methods, albeit at the cost of image quality. nuances present in facial expressions, especially in more
Striking the right balance between speed and output resolution is challenging or uncommon scenarios.
essential in optimizing end-to-end talking head synthesis. It is
important to highlight that these same limitations are also 3.1.5 Training data requirements
present for structural-based methods, particularly within their End-to-end approaches typically require a large amount of
rendering process. However, structural-based methods tend to be training data to generalize well across various situations. While
even slower than end-to-end approaches due to the additional structural-based methods can benefit from targeted, carefully
computational steps involved in their pipeline. Structural-based annotated datasets for specific tasks, end-to-end methods may
methods often require multiple stages, such as face detection, need a more diverse and extensive dataset to achieve comparable
facial landmark/3D model extraction, expression synthesis, performance. This, in turn, means longer training times as the model
photorealistic rendering and so on. Each of these stages needs to process and learn from a vast amount of data, which can be
introduces computational overhead, making the overall process computationally intensive and time-consuming. This can be a
more time-consuming. significant drawback for researchers and practitioners, as it
hinders the rapid experimentation and development of new
3.1.2 Cascading errors models. It may also require access to powerful hardware, such as
In structural-based methods, errors made in earlier stages of the high-performance GPUs or TPUs, to accelerate the training process.
pipeline can propagate and amplify throughout the process. For
example, inaccuracies in face or landmark detection can significantly 3.1.6 Explicit output guidance
impact the quality of the final generated video. End-to-end Structural-based methods allow researchers to incorporate
approaches, on the other hand, bypass the need for such explicit rules and constraints into different stages of the pipeline.
intermediate representations, reducing the risk of cascading This explicit guidance can lead to more accurate and controllable
errors. At the same time, however, when errors do occur in end- results, which can be lacking in end-to-end approaches where such
to-end approaches, it can be harder to identify the source of the guidane is more difficult to implement.
error, as such methods do not explicitly produce intermediate facial
representations. This lack of transparency in the generation process
can make it challenging for researchers to diagnose and troubleshoot 3.2 Structural based generation
issues when the output is not as expected. It becomes essential to
develop techniques for error analysis and debugging to improve the Structural based deep learning approaches have been immensely
reliability and robustness of end-to-end systems. popular in recent years, and are considered the dominant approach
when it comes to both talking head generation and audio driven
3.1.3 Robustness to different data automatic dubbing. As mentioned above, this is due to the relative
Structural-based methods rely on carefully curated and ease with which one can exert control over the final output video,
annotated datasets for each stage of the pipeline, which can be high quality image frame fidelity, and relative speed with which
time-consuming and labor-intensive to create. End-to-end animations can be driven for 3D character models.
approaches are often more adaptable and generalize better to Instead of training a single neural network to generate the
various speaking styles, accents, and emotional expressions, as desired video given an audio signal, the problem is typically
they can leverage large and diverse datasets for training. This broken up into two main steps: 1) Training a neural network to
flexibility is crucial in capturing the nuances and variations drive the facial motion from audio of an underlying structural
present in natural human speech and facial expressions. representation of the face. The structural representation is
typically either a 3D morphable model or 2D/3D keypoint
3.1.4 Output quality representation of the face. 2) Rendering photorealistic video
The quality of output is a critical aspect in talking head synthesis, frames from the structural model of the face using a second
as it directly impacts the realism and plausibility of the generated neural rendering model. Please see Table 1 for a summary of
videos. Structural-based methods excel in this regard due to their relevant structural-based approaches in the literature.
ability to exert more fine-grained control over the intermediate
representations of the face during the synthesis process. With such 3.2.1 2D/3D landmark based methods
methods, the face is typically represented using a set of keypoints (or In this section we discuss methods that rely on either 2D or 3D
3D model parameters), capturing essential facial features and face landmarks as an intermediate structural representation for
expressions. These landmarks serve as a structured guide for the producing facial animations from audio. Some of the discussed
generation of facial movements, ensuring that the resulting video methods use the generated landmarks to animate a 3D face model,
adheres to the anatomical constraints of a human face. By explicitly these methods shall also be considered “landmark-based.” Figure 2
controlling these keypoints, the model can produce more accurate depicts a high level overview of what a typical landmark-based
and realistic facial expressions that are consistent with human facial approach could look like.
anatomy. Suwajanakorn et al. (2017), Taylor et al. (2017) were among the
End-to-end approaches sacrifice some level of fine-grained first works to explore using deep learning techniques to generate
control in favor of simplicity and direct audio-to-video mapping. speech animation. The former trained a recurrent network to
While they offer the advantage of faster processing and reduced generate sparse mouth key points from audio before compositing
latency, they may struggle to capture the intricate details and them onto an existing video, and the latter presenting an approach
TABLE 1 Table summarising some of the most relevant structural-based approaches in the literature.
Method Animation network Audio input Intermediate Additional Head Rendering network
architecture representation inputs motion architecture
Suwajanakorn et al. LSTM MFCC PCA mouth coefficients None No AAM-based rendering
(2017)
Taylor et al. (2017) Feed forward Phoneme transcript Face model animation None No Video compositing
parameters approach
Das et al. (2020) GAN Deep speech features 2D landmarks None No GAN
Zhou et al. (2020) LSTM Learned speech 2D landmarks None Yes GAN
embeddings
Wang et al. (2021) LSTM MFCC + FBANK Keypoints—dense motion None Yes CNN
features field
Ji et al. (2021) LSTM Learned speech 2D landmarks + 3D face Driving video From video GAN
embeddings model
Bigioi et al. (2022) Recurrent LSTM Mel spectrogram 2D landmarks None Yes Not applicable
Karras et al. (2017) CNN Autocorrelation 3D vertex positions of face Emotional State No Not applicable
features mesh
Cudeiro et al. CNN Encoder-Decoder DeepSpeech features Flame face model None No Not applicable
(2019)
Thies et al. (2020) CNN DeepSpeech features 3D expression parameters None No CNN
Chen et al. (2020b) CNN Raw Audio 3D keypoints Reference frames Yes GAN
Yi et al. (2020) LSTM MFCC 3D expression parameters Driving video Yes GAN
Wu et al. (2021) Encoder-Decoder + Unet DeepSpeech features 3D expression parameters Driving video Yes GAN
Zhang et al. (2021b) GAN Learned speech 3D expression parameters Reference image Yes GAN
embeddings
Zhang et al. (2021a) GAN DeepSpeech features 3D expression parameters Driving video Yes GAN
Song et al. (2022) LSTM + Unet MFCC 3D expression parameters Driving video No UNet
Wen et al. (2020) GAN MFCC 3D expression parameters Driving video No GAN
Lahiri et al. (2021) CNN Spectrograms 3D vertex positions Driving video No CNN
for generalised speech animation by training a neural network before combining the two outputs and passing them through an off-
model to predict animation parameters of a reference face model the-shelf image-to-image translation network for generating
given phoneme labels as input. The field has come a long way since photorealistic frames.
then, with Eskimez et al. (2018) presenting a method for generating Lu et al. (2021)’s approach also simulated headpose and upper
static (no headpose) talking face landmarks from audio via a LSTM body motion using a separate auto regressive model trained on
based model, and Chen et al. (2019) expanding the work by deepspeech audio features before generating realistic frames
conditioning a GAN network on the landmarks to generate using an image-to-image translation model conditioned on
photorealistic frames. Similarly, Das et al. (2020) also employed a feature maps based on the generated landmarks. While also
GAN based architecture to generate facial landmarks from proposing an approach for the head pose problem, Wang
deepspeech features extracted from audio, before using a second et al. (2021) tackled the challenge of stabilising non-face
GAN conditioned on the landmarks to generate the photorealistic (background) regions when generating talking head videos
frames. from a single image.
Zhou et al. (2020)’s approach was among the first to generate Unlike the previous methods which were all approaches at
talking face landmarks with realistic head pose movement from solving the talking head generation task, the following papers fall
audio. They did this by training two LSTM networks, one to handle into the audio-driven automatic dubbing category that seek to
the lip/jaw movements, and a second to generate the headpose, modify existing videos. Ji et al. (2021) were among the first to
FIGURE 2
A high-level landmark based pipeline, where headpose and face structure from an existing video is combined with predicted lip/jaw displacements
from a target audio clip to generate a modified video.
FIGURE 3
A high-level 3d model based pipeline, where monocular facial reconstruction is performed on a source video to extract expression, pose, and
geometry parameters. A separate audio to expression parameter prediction network is then trained. The predicted expression parameters are then used
to replace the original ones, to generate a new 3D facial mesh, which is then rendered into a photorealistic video via a neural rendering model.
tackle the problem of generating emotionally aware video portraits Karras et al. (2017) were among the first to use deep learning to
by disentangling speech into two representations, a content-aware learn facial animation for a 3D face model from limited audio
time dependent stream, and an emotion-aware time independent data. Cudeiro et al. (2019) introduced a 4D audiovisual face
stream, and training a model to generate 2D facial landmarks. It may dataset (talking 3D models), as well as a network trained to
be considered a “hybrid” structural approach, as from both the generate 3D facial animations from deepspeech audio features.
predicted and ground truth landmarks they perform monocular 3D Thies et al. (2020) also utilised deepspeech audio features to train
reconstruction to obtain two 3D face models. They then combine the a network to output speaker independent facial expression
pose parameters from the ground truth with the expression and parameters that drive an intermediate 3D face model before
geometry parameters of the predicted to create the final 3D face generating the photorealistic frames using a neural rendering
model before extracting edge maps and generating the output frames model. Chen et al. (2020b)’s approach involved learning head
via image-to-image translation. Bigioi et al. (2022) extracted ground motion from a collection of reference frames, and then
truth 3D landmarks from video, and trained a network to alter them combining that information with learned PCA components
directly given an input audio sequence without the need to first denoting facial expression in a 3D aware frame generation
retarget them to a static fixed face model before animating it and network. Their approach is interesting because their pipeline
then returning the original headpose. addresses various known problems within talking head
generation such as maintaining the identity/appearance of the
3.2.2 3D model based methods head consistent, maintaining a consistent background, and
In this section we discuss methods that use 3D face models as generating realistic speaker aware head motion. Yi et al.
intermediate representations when generating facial animations. (2020) presented an approach to generate talking head videos
In other words, we talk about methods that train models to using a driving audio signal by training a neural network to
produce blendshape face parameters from audio signals as predict pose and expression parameters for a 3D face model from
input. Figure 3 above depicts a high-level overview of one such audio, and combining them with shape, texture, and lighting
model. parameters extracted from a set of reference frames. They then
render the 3D face model to photo realism via a neural renderer, 3.3 End-to-end generation
before fine tuning the rendered frames with a memory
augmented GAN. Wu et al. (2021) presented an approach to Though less popular in recent times than their structural based
generate talking head faces of a target portrait given a driving counterparts, the potential to generate or modify a video directly
speech signal, and “Style Reference Video.” They train their given an input audio signal is one of the key factors that make end-
model such that the output video mimics the speaking style of to-end approaches an attractive proposition to talking head
the reference video but whose identity corresponds to the target researchers. These methods aim to learn the complex mapping
portrait. Zhang et al. (2021b) presented a method for one shot between audio, facial expressions and lip movements using a
talking head animation. Given a reference frame and driving single unified model that combines the traditional stages of
audio source they generate eyebrow, head pose, and mouth talking head generation into a single step. By doing so, they
motion parameters of a 3D morphable model using an eliminate the need for explicit intermediate representations, such
encoder-decoder architecture. A flow-guided video generator as facial landmarks, or 3D models, which can be computationally
is then used to create the final output frames. Zhang et al. expensive and prone to error. This ability to directly connect the
(2021a) synthesize talking head videos given a driving speech audio input to the video output streamlines the synthesis process
input and reference video clip. They design a GAN based module and can enable real-time or near-real-time generation. Please see
that can output expression, eyeblink, and headpose parameters of Table 2 for a summary of relevant end-to-end based approaches in
a 3D MM given deepspeech audio features. the literature.
While the previously referenced methods are all examples of Chung et al. (2017) proposed one of the first end-to-end talking
pure talking head generation approaches, the following are in the head generation techniques. Given a reference identity frame and
automatic dubbing category. Both Song et al. (2022) and Wen et al. driving speech audio signal, they succeeded in training an encoder-
(2020) presented approaches to modify an existing video using a decoder based architecture to generate talking head videos,
driving audio signal by training a neural network to extract 3D face additionally demonstrating how their approach could be applied
model expression parameters from audio, and combining them with to the dubbing problem. Their approach was limited however as it
pose and geometry parameters extracted from the original video only generated the cropped region around the face, discarding any
before applying neural rendering to generate the modified background.
photorealistic video. To generate the facial animations, Song Chen et al. (2018) presented a GAN based method of generating
et al. (2021) employ a similar pipeline to the methods referenced lip movement from a driving speech source and reference lip frame.
above, however they go one step further, and transfer the acoustic Similar to the above method, theirs was limited to generating just the
properties of the original video’s speaker onto the driving speech via cropped region of the face surrounding the lips. Song et al. (2018)
an encoder-decoder mechanism, essentially dubbing the video. presented a more generalised GAN-based approach for talking head
Richard et al. (2021) provided a generalised framework for generation that also took the temporal consistency between frames
generating accurate 3D facial animations given speech, by into account by introducing a recurrent unit in their pipeline,
learning a categorical latent space that disentangles audio- generating smoother videos. Zhou et al. (2019) proposed a model
correlated (lips/jaw motion), and audio un-correlated (eyeblinks, that could generate videos based on learned disentangled
upper facial expression) information at inference time. Doing so, representations of speech and video. The approach is interesting
they built a framework that can be applied to both automatic because it allowed authors to generate a talking head video from a
dubbing, and talking head generation tasks. Lahiri et al. (2021) reference identity frame, and driving speech signal or video. Mittal
introduced an encoder-decoder architecture trained to decode 3D and Wang (2020) disentangled the audio signal into various factors
vertex positions [similar to Karras et al. (2017)], and 2D texture such as phonetic content, and emotional tone, and conditioned a
maps of the lip region from audio and the previously generated talking head generative model on these representations instead of
frame. They combine these to form a textured 3D face mesh which the raw audio, demonstrating compelling results. Vougioukas et al.
they then render and blend with the original video to generate the (2020) proposed an approach to generate temporally consistent
dubbed video clip. talking head videos from a reference frame and audio using a GAN-
We would also like to draw attention to the works of Fried et al. based approach. Their method generated realistic eyeblinks in
(2019) and Yao et al. (2021). These are video editing approaches addition to synchronised lip movements in an end-to-end
which utilise text, in addition to audio, to modify existing talking manner. Prajwal et al. (2020) introduced a “lip-sync
head videos. The former approach works by aligning phoneme labels discriminator” for generating more accurate lip movements on
to the input audio, and constructing a 3D face model for each input talking head videos, as well as proposing new metrics to evaluate
frame. Then, when modifying the text transcript (e.g., dog to god), lip synchronization on generated videos. Eskimez et al. (2020)
they search for segments of the input video where the visemes are proposed a robust GAN based model that could generate talking
similar, blending the 3D model parameters from the corresponding head videos from noisy speech. Kumar et al. (2020) proposed a
video frames to generate a new frame which is then rendered via GAN-based approach for one shot talking head generation. Zhou
their neural renderer. The latter approach builds off this work, by et al. (2021) proposed an interesting approach to exert control over
improving the efficiency of the phoneme matching algorithm, and the pose of an audio-driven talking head. Using a target “pose”
developing a self-supervised neural retargeting technique for video, and speech signal, they condition a model to generate talking
transferring the mouth motions of the source actor to the target head videos from a single reference identity image whose pose is
actor. dictated by the target video.
TABLE 2 Table summarising some of the most relevant end-to-end approaches in the literature.
Chen et al. (2018) GAN Mel spectrogram Reference lip image No Limited to lip region only
Mittal and Wang (2020) LSTM + GAN Learned speech Reference image No Yes
embeddings
Prajwal et al. (2020) GAN Mel spectrogram Driving video Yes Yes
Eskimez et al. (2020) LSTM + GAN Raw audio Reference image No Yes
Zhou et al. (2021) GAN Spectrograms Driving video + reference frame Yes Yes
Stypułkowski et al. Diffusion Unet Learned speech Reference image Yes Yes
(2023) embeddings
Shen et al. (2023) Diffusion Unet Learned speech Reference image + face Yes Yes
embeddings landmarks
Bigioi et al. (2023) Diffusion Unet Mel spectrograms Reference image Yes Yes
While GAN-based Goodfellow et al. (2014) methods such as the (2020); Dhariwal and Nichol (2021); Nichol and Dhariwal (2021), as
approaches referenced above have been immensely popular in these are the pioneering works that contributed to their recent
recent years, they have been shown to have a number of popularity and wide-spread adoption. In short however the diffusion
limitations by practitioners in the field. Due to the presence of process can be summarised as consisting of two stages 1) the forward
multiple losses and discriminators their optimization process is diffusion process, and 2) the reverse diffusion process.
complex and quite unstable. This can lead to difficulties in In the forward diffusion process, the desired output data is
finding a balance between the generator and discriminator, gradually “destroyed” over a series of time steps by adding Gaussian
resulting in issues like mode collapse, where the generator fails to noise at each step until the data becomes just another sample from a
capture the full diversity of the target distribution. Vanishing standard Gaussian distribution. Conversely, in the reverse diffusion
gradients is another issue, which occurs when gradients become process, a model is trained gradually denoise the data by removing
too small during back propagation, preventing the model from the noise at each time step, with the loss typically being computed as
learning effectively, especially in deeper layers. This can a distance function between the predicted noise vs. the actual noise
significantly slow down the training process and limit the overall that was added at that particular time step. The combination of these
performance of the model. With that in mind, we would like to draw two stages enables diffusion models to model complex data
special attention to diffusion models (Sohl-Dickstein et al., 2015, Ho distributions without suffering from mode collapse unlike GANs,
et al., 2020, Dhariwal and Nichol, 2021, Nichol and Dhariwal, 2021), and to generate high-quality samples without the need for
a new class of generative model that has gained prominence in the adversarial training or complex loss functions.
last couple of years due to strong performance on a myriad of tasks Within the context of talking head generation, and video editing
such as text based image generation, speech synthesis, colourisation, there are a number of recent works that have explored using diffusion
body animation prediction, and more. models. Specifically, Stypułkowski et al. (2023), Shen et al. (2023), and
Bigioi et al. (2023) being among the first to explore their use for end-
to-end talking head generation and audio driven video editing. All
3.4 Diffusion-based generation three methods follow a similar auto-regressive frame-based approach
where the previously generated frame is fed back into the model along
We dedicate a short section of this paper towards diffusion based with the audio signal and a reference identity frame to generate the
approaches, due to their recent rise in use and popularity. Note that next frame in the sequence. Notably, Shen et al. (2023) condition their
within this section, we describe methods found from both the end- model with landmarks, and perform their training within the latent
to-end, and structural-based schools of thought as at this time, there space to save on computational resources, unlike that of Stypułkowski
are only a handful of diffusion-based talking head works. et al. (2023) and Bigioi et al. (2023). Stypułkowski et al. (2023)
For a deeper understanding of the diffusion architecture, we approach can be considered a true talking head generation
direct readers to works of Sohl-Dickstein et al. (2015); Ho et al. method, as their method does not rely on any frames from the
original video to guide their model (except for the initial seed/identity 4.1 Evaluation metrics
frame), and their resultant video is completely synthetic. Bigioi et al.
(2023) perform video editing by modifying an existing video sequence Quantitatively evaluating both talking head, and dubbed videos
by teaching their model to inpaint on a masked out facial region of the is a non-straight forward task. Traditional perceptual metrics such as
video in response to an input speech signal. Shen et al. (2023)’s SSIM, or distance-based metrics such as the L2 Norm, or PSNR,
approach is similar, where they perform video editing rather than which seek to quantify the similarity between two images, are
talking head generation by modifying an existing video with the use of inadequate. Such metrics do not take into account the temporal
a face mask designed to cover the facial region of the source video. nature of video, with the quality of a video being affected not only by
While the above approaches are currently the only end-to-end the individual quality of frames, but also by the smoothness and
diffusion based methods, a number of structural based approaches, that synchronisation of the frames as they are played back in the video.
leverage diffusion models have also been proposed in recent months. Although these metrics may not provide a perfect evaluation of
Zhang et al. (2022) proposed an approach that used audio to predict video quality, they are still important for bench marking purposes as
landmarks, before using a diffusion based renderer to output the final they provide a good indication of what to expect from the model. As
frame. Zhua et al. (2023) also utilised a diffusion model similarly, using such, when there is access to ground truth samples to compare a
it to take the source image and the predicted motion features as input to model’s output with, the following metrics are commonly used:
generate the high-resolution frames. Du et al. (2023) introduced an PSNR (Peak Signal to Noise Ratio): The peak signal to noise ratio
interesting two stage approach for talking head generation. The first between the ground truth and the generated image is computed. The
stage consisted of training a diffusion autoencoder on video frames, to higher the PSNR value, the better the quality of the reconstructed
extract latent representations of the frames. The second stage involved image.
training a speech to latent representation model, with the idea being that Facial Action Units (AU) Ekman and Friesen (1978)
the latents predicted by the speech, could be decoded by the pretrained Recognition: Song et al. (2018) and Chen et al. (2020b)
diffusion autoencoder to image frames. The method achieves popularised a method for evaluating reconstructed images with
impressive results, outperforming other relevant structural-based respect to ground truth samples using five facial action units.
methods in the field. Xu et al. (2023) use a diffusion-based renderer ACD (Average Content Distance) (Tulyakov et al., 2018): As
conditioned on multi-model inputs to drive the emotion, and pose of used by Vougioukas et al. (2020), the Cosine (ACD-C) and
the generated talking head videos. Notably their approach is also Euclidean (ACD-E) distance between the generated frame and
applicable to the face swapping problem. ground truth image can be calculated. The smaller the distance
Within the realm of talking heads, diffusion models have shown between two images the more similar the images.
incredibly promising results, often producing videos with SSIM (Structural Similarity Index) (Wang et al., 2004): This is a
demonstratively higher visual quality, and similar lip sync metric designed to measure the similarity between two images by
performance compared to more traditional GAN-based methods. looking at the luminance, contrast, and structure of the pixels in the
One major limitation, however, lies in their inability to model long images.
sequences of frames without the output degrading in quality over Landmark Distance Metric (LMD): Proposed by Chen et al.
time due to their autoregressive nature. It will be exciting to see what (2018), Landmark Distance (LMD) is a popular metric used to
the future holds for further research in this area. evaluate the lip synchronisation of a synthetic video. It works by
extracting facial landmark lip coordinates for each frame of both the
generated, and ground truth videos using an off-the-shelf facial
3.5 Other approaches landmark extractor, calculating the euclidean distance between
them, and normalising based on the length of video and number
There are certain approaches that do not necessarily fit into the of frames.
aforementioned subcategories, that are still relevant and worth discussing. Unfortunately, when generating talking head or dubbed videos,
Viseme based methods such as Zhou et al. (2018) are early oftentimes it is impossible to use the metrics discussed above as there
approaches at driving 3D character models. The authors presented is no corresponding ground truth data with which to compare the
an LSTM based network capable of producing viseme curves that could generated samples. Therefore, a number of perceptual metrics
drive JALI based character models as described by Edwards et al. (2016). (metrics which seek to emulate how humans perceive things)
Guo et al. (2021) is a unique method for talking head generation have been proposed to address this problem. These include:
that instead of relying on traditional intermediate structural CPBD (Cumulative Probability Blur Detection) (Narvekar and
representations such as landmarks or 3DMMs, instead generates Karam, 2011): This is a perceptual based metric used to detect blur in
a neural radiance field from audio from which a realistic video is imageness and measure image sharpness. Used by Kumar et al.
synthesised using volume rendering. (2020); Vougioukas et al. (2020); Chung et al. (2017) to evaluate
their talking head videos.
WER (Word Error Rate): A pretrained lip reading model is used
4 Popular datasets and evaluation to predict the words spoken by the generated face. Works such as
metrics Kumar et al. (2020) and Vougioukas et al. (2020) use the LipNet
Assael et al. (2016) model which is pre-trained on the GRID data set
In this section we describe the most popular metrics for and achieves 95.2 percent lip reading accuracy.
measuring the quality of videos generated by audio-driven talking SyncNet Based Metrics: These are perceptual metrics based on
head, and automatic dubbing models. the SyncNet model introduced by Chung and Zisserman (2017b)
that evaluate lip synchronisation in unconstrained videos. Prajwal • CelebV-HQ (Zhu et al., 2022): CelebV-HQ is a dataset
et al. (2020) introduced two such metrics: 1) LSE-D which is the containing 35,666 video clips involving 15,653 identities
average error measure calculated in terms of the distance between and 83 manually labeled facial attributes covering aspects
the lip and audio representations, and 2) LSE-C which is the average such as appearance, action, and emotion
confidence score. These metrics have proven popular since their
introduction, with a vast majority of recent papers in the field using
them for evaluating their videos. 5 Open challenges
Although significant progress has been made in the fields of
4.2 Benchmark Datasets talking head generation and automatic dubbing, these areas of
research are constantly evolving, and several open challenges still
There are a number of benchmark datasets used to evaluate need to be addressed, offering plenty of opportunities for
talking head and video dubbing models. They can be broadly future work.
categorised as being either “in-the-wild,” or “lab conditions” style
datasets. In this section we list some of the most popular ones, and
briefly describe them. 5.1 Bridging the uncanny valley
• VoxCeleb 1 and 2 (Nagrani et al., 2017; Chung et al., 2018): Despite existing research, generating truly realistic talking heads
This dataset contains audio and video recordings of celebrities remains an unsolved problem. There are various factors that come
speaking in the wild. It is often used for training and evaluating into play when discussing the topic of realism and how we can bridge
talking head generation, lip reading, and dubbing models. The the “Uncanny Valley” effect in video dubbing. These include:
former contains over 150,000 utterances from
1,251 celebrities, and the latter over 1,000,000 utterances • Visual quality: Realistic talking head videos should have high-
from 6,112 celebrities. quality visuals that accurately capture the colors, lighting, and
• GRID (Cooke et al., 2006): The GRID dataset consists of audio textures of the scene. This requires attention to detail in the
and video recordings of 34 speakers reading 1,000 sentences in rendering process. Currently, most talking head and visual
lab conditions. It is commonly used for evaluating lip-reading dubbing approaches are limited to generating videos at low
algorithms but has also been used for talking head generation output resolutions, and those that do work on higher
and video dubbing models. resolutions are quite limited both in terms of model
• LRS3-TED (Afouras et al., 2018): This dataset contains audio robustness, and generalisation (more on that later). This is
and video recordings of over 400 h of TED talks, which are due to several reasons: 1) the computational complexity of
speeches given by experts in various fields. deep learning models rises significantly when generating high-
• LRW (Chung and Zisserman, 2017a): The LRW (Lip Reading resolution videos, both in terms of training time, and inference
in the Wild) dataset consists of up to 1,000 utterances of speed; this, in turn, has an adverse effect on real-time
500 different words, spoken by hundreds of different speakers performance; 2) generating realistic talking head videos
in the wild. requires the model to capture intricate details of facial
• CREMA-D (Cao et al., 2014): This dataset contains audio and expressions, lip movements, and speech patterns; as the
video recordings of people speaking in various emotional output resolution of the video increases, so too does the
states (happy, sad, anger, fear, disgust, and neutral). In total demand for more fine-grained details, making it more
it contains 7,442 clips of 91 different actors recorded in lab difficult for models to achieve high degrees of realism; 3)
conditions. Storage and bandwidth limitations; high-resolution videos
• TCD-TIMIT (Harte and Gillen, 2015): The Trinity College require both of these in abundance, limiting high resolution
Dublin Talking Heads dataset (TCD-TIMIT) contains video generation to researchers who have access to state of the art in
recordings of 62 actors speaking in a controlled environment. hardware systems. Some approaches that have sought to tackle
• MEAD Dataset (Wang et al., 2020): This dataset contains this issue are the works of Gao et al. (2023), Guo et al. (2021),
videos featuring 60 actors talking with eight different emotions and Shen et al. (2023), who’s approaches are capable of
at three different intensity levels (except for neutral). The outputting high resolution frames.
videos are simultaneously recorded at seven different • Motion: Realistic talking head/dubbed videos should have
perspectives with roughly 40 h of speech recorded for each realistic motion, including smooth and natural movements
person. of the face in response to speech, and realistic head motion
• RAVDESS Dataset (Wang et al., 2020): The Ryerson Audio- when generating videos from scratch. This is a continuous
Visual Database of Emotional Speech and Song is a corpus topic of interest, with many works exploring it such as Chen
consisting of 24 actors speaking with calm, happy, sad, angry, et al. (2020b), Wang et al. (2021), and more recently Zhang
fearful, surprise, and disgust expressions, and singing with et al. (2023).
calm, happy, sad, angry, and fearful emotions. Each expression • Disembodied Voice: The phenomenon of a Disembodied
is produced at two levels of emotional intensity, with an Voice is characterized by a jarring mismatch between a
additional neutral expression. It contains 7,356 recordings speaker’s voice and their physical appearance, which is a
in total. commonly encountered issue in movie dubbing. Despite its
significance, this issue remains relatively unexplored within under the self-supervised learning paradigm, have gained popularity
the realm of talking head literature, thereby presenting a in speech-related fields due to their promising results in improving
promising avenue for researchers to investigate further. The the robustness and generalization of models. These methods may
work conducted by Oh et al. (2019) demonstrated that there is help overcome the limitations of traditional supervised learning
an inherent link between a speaker’s voice and their methods that rely solely on labeled data for training. That being said,
appearance that can be learned, thus lending credence to Radford et al. (2022) showed that while such methods can learn
the idea that dubbing efforts should prioritize the high-quality representations of the input they are being trained
synchronization of voice and appearance. on,“they lack an equivalently performant decoder mapping those
• Emotion: Realistic videos should evoke realistic emotions, representations to useable outputs, necessitating a finetuning stage
including facial expressions, body language, and dialogue. in order to actually perform a task such as speech recognition”. The
Achieving realistic emotions requires careful attention to authors demonstrate that by training their model on a “weakly-
acting and performance, as well as attention to detail in the supervised” dataset of 680,000 h of speech, their model performs
animation and sound design. Recent works seeking to well on unseen datasets without the need to finetune. What this
incorporate emotion into their generated talking heads means for talking head generation/dubbing is that a model trained
include Ma et al. (2023), Liang et al. (2022), Li et al. (2021). on large amounts of “weakly-supervised,” or in other words,
imperfect data, may potentially acquire a higher level of
generalization. This can be particularly valuable for tasks like
5.2 The data problem: single vs. talking head generation or dubbing, where a system needs to
multispeaker approaches understand and replicate various speech patterns, accents, and
linguistic nuances that might not be explicitly present in
As mentioned previously there are two primary approaches to labeled data.
video dubbing—structural and end-to-end. In order to train a model
to generate highly photorealistic talking head videos with current
end-to-end methods, many dozens of hours of single-speaker 5.4 The multilingual aspect
audiovisual content are required. The content should be of a
high quality with factors such as good lighting, consistent In the realm of talking head generation, it is fascinating to
framing of the face, and clear audio data. The quantity of data observe the adaptability of models trained exclusively on English-
on an individual speaker may be reduced when methods are trained language datasets when faced with speech from languages they have
on a multi-speaker dataset, but sufficiently large datasets are only not encountered during training. This phenomenon can be
starting to become available. At this point in time it is not possible to attributed to the models’ proficiency in learning universal
estimate how well end-to-end methods might generalize to multiple acoustic and linguistic features. While language diversity entails a
speakers, or how much data may eventually be required to fine-tune wide array of phonetic, prosodic, and syntactic intricacies, there
a dubbing model for an individual actor in a movie to achieve a exists an underpinning foundation of shared characteristics that
realistic mimicry of their facial actions. The goal should be of the traverse linguistic boundaries. These foundational aspects, intrinsic
order of tens of minutes of data, or less, to allow for the dubbing of to human speech, include elements like phonetic structure and
the majority of characters with speaking roles. prosodic patterns, which exhibit commonalities across languages.
Talking head generation models that excel in capturing these
universal attributes inherently possess the ability to generate lip
5.3 Generalisation and robustness motions that align with a range of linguistic expressions, irrespective
of language.
Developing a model that can generalize across all faces, and While the lip movements generated by models trained on
audios, under any conditions such as poor lighting, partial occlusion, English-language datasets may exhibit a remarkable degree of
or incorrect framing, remains a challenging task yet to be fully fidelity when applied to unseen languages, capturing cultural
resolved. behaviors associated with those languages is a more intricate
While supervised learning has proven to be a powerful approach endeavor. Cultural gestures, expressions, and head movements
for training models, it typically requires large amounts of labeled often bear an intimate connection with language and its subtle
data that are representative of the target distribution. However, intricacies. Unfortunately, these models, despite their linguistic
collecting diverse and balanced datasets that cover all possible adaptability, may lack the exposure needed to capture these
scenarios and variations in facial appearance and conditions is a culturally specific behaviors accurately. For instance, behaviors
challenging and time-consuming task. Furthermore, it is difficult to like the distinctive head movements indicative of agreement in
anticipate all possible variations that the model may encounter certain cultures remain a challenge for these models. This
during inference, such as changes in lighting conditions or facial underscores the connection between language and culture,
expressions. highlighting the need for models to not only decipher linguistic
To address these challenges, researchers have explored components but also to appreciate and simulate the cultural nuances
alternative approaches such as self-supervised learning, which that accompany them. As such, we believe that further research is
aims to learn from unlabelled data by creating supervisory necessitated to ensure a unified representation of both linguistic and
signals from the data itself. In other words, self-labelling the data. cultural dimensions in the realm of talking head generation and
Methods such as Baevski et al. (2020); Hsu et al. (2021), which fall automatic dubbing, leaving this an open challenge to the field.
5.5 Ethical and legal challenges field of research, driven by the needs of the video streaming industry
and there are may interesting synergies with a range of neural
Lastly we mention that the modification of original digital media technologies, including auto-translation services, text-to-speech
content is subject to a wide range of ethical and data-protection synthesis, and talking-head generators. In addition to a review
considerations. While it is expected for most digital content that the and discussion of the recent literature we have also outlined
work of paid actors is considered as “work for hire,” there are some of the key challenges that remain to blend today’s neural
broader considerations if auto-dubbing technology becomes broadly technologies into practical implementations of tomorrow’s digital
adopted. Even as we write there is a large-scale strike of actors in media services.
Hollywood, fighting for rights with respect to the use of AI generated This work may serve both as an introduction and reference
acting sequences. A full discussion of the broad ethical and guide for researchers new to the fields of automatic dubbing, and
intellectual property implications arising as today’s AI talking head generation, but also seeks to draw attention to the latest
technologies mature into sophisticated end-products for digital techniques and new approaches and methodologies for those who
content creation would require a separate article. already have some familiarity with the field. We hope it will
Ultimately there is a clear need for advanced IP rights encourage and inspire new research and innovation on this
management within the digital media creating industry. Past emerging research topic.
efforts have focused on media manipulation, such as
fingerprinting or encryption (Kundur and Karthik, 2004) but
were ultimately unsuccessful. More recently researchers have Author contributions
proposed techniques such as blockchain might be used in the
context of subtitles (Orero and Torner, 2023), while legal All authors listed have made a substantial, direct, and intellectual
researchers have provided a broader context for the challenge of contribution to the work and approved it for publication.
digital copyright in the context of the evolution of the Metaverse
(Jain and Srivastava, 2022). Clearly, multi-lingual video dubbing
represents just one specific sub-context of this broader ethical and Funding
regulatory challenge.
Looking at ethical considerations for the focused topic of multi- This work has the financial support of the Science Foundation
lingual video-dubbing one practical approach is to adopt a Ireland Centre for Research Training in Digitally-Enhanced Reality
methodology that can track pipeline usage. One technique (d-real) under Grant No. 18/CRT/6224, and the ADAPT Centre
adopted in the literature is to build traceability into the pipeline (Grant 13/RC/2106).
itself, as discussed by Pataranutaporn et al. (2021). These authors
have included both human and machine traceability methods into
their pipeline to ensure safe and ethical use thereof. Their human Conflict of interest
traceability technique was inspired by fabrication detection
techniques drawn from other media paradigms (e.g., text, video) The authors declare that the research was conducted in the
and incorporates perceivable traces like signatures of authorship, absence of any commercial or financial relationships that could be
distinguishable appearance or small editing artefacts into the construed as a potential conflict of interest.
generated media. Machine traceability, on the other hand,
involves incorporating traces imperceptible to humans, such as
non-visible noise signals. Publisher’s note
All claims expressed in this article are solely those of the authors
6 Concluding thoughts and do not necessarily represent those of their affiliated organizations,
or those of the publisher, the editors and the reviewers. Any product
In this paper we have attempted to capture the current state-of- that may be evaluated in this article, or claim that may be made by its
art for automated, multi-lingual video dubbing. This is an emerging manufacturer, is not guaranteed or endorsed by the publisher.
References
Afouras, T., Chung, J. S., and Zisserman, A. (2018). Lrs3-ted: A large-scale dataset for Bigioi, D., Jordan, H., Jain, R., McDonnell, R., and Corcoran, P. (2022). Pose-aware
visual speech recognition. https://arxiv.org/abs/1809.00496. speech driven facial landmark animation pipeline for automated dubbing. IEEE Access
10, 133357–133369.
Alarcon, N. (2023). Netflix builds proof-of-concept AI model to simplify subtitles for translation.
Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008).
Assael, Y. M., Shillingford, B., Whiteson, S., and De Freitas, N. (2016). Lipnet: end-to-
Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42,
end sentence-level lipreading. https://arxiv.org/abs/1611.01599.
335–359. doi:10.1007/s10579-008-9076-6
Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self-
Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., and Verma, R.
supervised learning of speech representations. Adv. neural Inf. Process. Syst. 33, 12449–12460.
(2014). Crema-d: crowd-sourced emotional multimodal actors dataset. IEEE Trans.
doi:10.48550/arXiv.2006.11477
Affect. Comput. 5, 377–390. doi:10.1109/TAFFC.2014.2336244
Bigioi, D., Basak, S., Jordan, H., McDonnell, R., and Corcoran, P. (2023). Speech
Cao, Y., Tien, W. C., Faloutsos, P., and Pighin, F. (2005). Expressive speech-driven
driven video editing via an audio-conditioned diffusion model. https://arxiv.org/abs/
facial animation. ACM Trans. Graph. (TOG) 24, 1283–1302. doi:10.1145/1095878.
2301.04474. doi:10.1109/ACCESS.2022.3231137
1095881
Chen, L., Cui, G., Kou, Z., Zheng, H., and Xu, C. “What comprises a good talking-head Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C. C., Cao, X., et al. (2021). Audio-driven
video generation?,” in Proceedings of the IEEE/CVF Conference on Computer Vision emotional video portraits. https://arxiv.org/abs/2104.07452.
and Pattern Recognition Workshops, Glasgow, UK 2020a.
Karras, T., Aila, T., Laine, S., Herva, A., and Lehtinen, J. (2017). Audio-driven facial
Chen, L., Cui, G., Liu, C., Li, Z., Kou, Z., Xu, Y., et al. “Talking-head generation with animation by joint end-to-end learning of pose and emotion. ACM Trans. Graph.
rhythmic head motion,” in Proceedings of the Computer Vision–ECCV 2020: 16th (TOG) 36, 1–12. doi:10.1145/3072959.3073658
European Conference, Glasgow, UK, August 2020b.
Kumar, N., Goel, S., Narang, A., and Hasan, M. (2020). Robust one shot audio to video
Chen, L., Li, Z., Maddox, R. K., Duan, Z., and Xu, C. (2018). Lip movements generation. https://arxiv.org/abs/2012.07842.
generation at a glance. https://arxiv.org/abs/1803.10404.
Kundur, D., and Karthik, K. (2004). Video fingerprinting and encryption principles
Chen, L., Maddox, R. K., Duan, Z., and Xu, C. (2019). Hierarchical cross-modal for digital rights management. Proc. IEEE 92, 918–932. doi:10.1109/JPROC.2004.
talking face generation with dynamic pixel-wise loss. https://arxiv.org/abs/1905.03820. 827356
Chung, J. S., Jamaludin, A., and Zisserman, A. (2017). You said that? https://arxiv.org/ Lahiri, A., Kwatra, V., Frueh, C., Lewis, J., and Bregler, C. (2021). Lipsync3d: data-
abs/1705.02966. efficient learning of personalized 3d talking faces from video using pose and lighting
normalization. https://arxiv.org/abs/2106.04185.
Chung, J. S., Nagrani, A., and Zisserman, A. (2018). Voxceleb2: deep speaker
recognition. https://arxiv.org/abs/1806.05622. Łańcucki, A. “Fastpitch: parallel text-to-speech with pitch prediction,” in Proceedings
of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and
Chung, J. S., and Zisserman, A. “Lip reading in the wild,” in Proceedings of the
Signal Processing (ICASSP), Toronto, ON, Canada, June 2021.
Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei,
Taiwan, November 2017a, 87–103. Li, L., Wang, S., Zhang, Z., Ding, Y., Zheng, Y., Yu, X., et al. (2021). Write-a-speaker:
text-based emotional and rhythmic talking-head generation. Proc. AAAI Conf. Artif.
Chung, J. S., and Zisserman, A. “Out of time: automated lip sync in the wild,” in
Intell. 35, 1911–1920. doi:10.1609/aaai.v35i3.16286
Proceedings of the Computer Vision–ACCV 2016 Workshops: ACCV
2016 International Workshops, Taipei, Taiwan, November 2017b, 251–263. Liang, B., Pan, Y., Guo, Z., Zhou, H., Hong, Z., Han, X., et al. “Expressive talking head
generation with granular audio-visual control,” in Proceedings of the IEEE/CVF
Cooke, M., Barker, J., Cunningham, S., and Shao, X. (2006). An audio-visual corpus
Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA,
for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120,
June 2022, 3387–3396.
2421–2424. doi:10.1121/1.2229005
Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., et al. (2023). Audioldm: text-
Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., and Black, M. J. (2019). Capture,
to-audio generation with latent diffusion models. https://arxiv.org/abs/2301.12503.
learning, and synthesis of 3d speaking styles. https://arxiv.org/abs/1905.03079.
Lu, Y., Chai, J., and Cao, X. (2021). Live speech portraits: real-time photorealistic
Das, D., Biswas, S., Sinha, S., and Bhowmick, B. “Speech-driven facial animation using
talking-head animation. ACM Trans. Graph. (TOG) 40, 1–17. doi:10.1145/3478513.
cascaded gans for learning of motion and texture,” in Proceedings of the Computer
3480484
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 2020, 408–424.
Ma, Y., Wang, S., Hu, Z., Fan, C., Lv, T., Ding, Y., et al. (2023). Styletalk: one-shot
Dhariwal, P., and Nichol, A. (2021). Diffusion models beat gans on image synthesis.
talking head generation with controllable speaking styles. https://arxiv.org/abs/2301.
Adv. neural Inf. Process. Syst. 34, 8780–8794. doi:10.48550/arXiv.2105.05233
01081.
Du, C., Chen, Q., He, T., Tan, X., Chen, X., Yu, K., et al. (2023). Dae-talker: high
Mariooryad, S., and Busso, C. (2012). Generating human-like behaviors using joint,
fidelity speech-driven talking face generation with diffusion autoencoder. https://arxiv.
speech-driven models for conversational agents. IEEE Trans. Audio, Speech, Lang.
org/abs/2303.17550.
Process. 20, 2329–2340. doi:10.1109/TASL.2012.2201476
Duquenne, P.-A., Elsahar, H., Gong, H., Heffernan, K., Hoffman, J., Klaiber, C., et al.
Mittal, G., and Wang, B. “Animating face using disentangled audio representations,”
(2023). SeamlessM4t—massively multilingual and multimodal machine translation.
in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Menlo Park, California, United States: Meta.
Vision, Snowmass, CO, USA, March 2020, 3290–3298.
Edwards, P., Landreth, C., Fiume, E., and Singh, K. (2016). Jali: an animator-centric
Nagrani, A., Chung, J. S., and Zisserman, A. (2017). Voxceleb: A large-scale speaker
viseme model for expressive lip synchronization. ACM Trans. Graph. (TOG) 35, 1–11.
identification dataset. https://arxiv.org/abs/1706.08612.
doi:10.1145/2897824.2925984
Narvekar, N. D., and Karam, L. J. (2011). A no-reference image blur metric based on
Ekman, P., and Friesen, W. V. (1978). Facial action coding system. Environ. Psychol.
the cumulative probability of blur detection (cpbd). IEEE Trans. Image Process. 20,
Nonverbal Behav. doi:10.1037/t27734-000
2678–2683. doi:10.1109/TIP.2011.2131660
Eskimez, S. E., Maddox, R. K., Xu, C., and Duan, Z. “End-to-end generation of talking
Nichol, A. Q., and Dhariwal, P. “Improved denoising diffusion probabilistic models,”
faces from noisy speech,” in Proceedings of the ICASSP 2020-2020 IEEE international
in Proceedings of the International Conference on Machine Learning (PMLR), 2021,
conference on acoustics, speech and signal processing (ICASSP), Barcelona, Spain, May
Glasgow, UK 8162–8171.
2020, 1948–1952.
Nilesh, C., and Deck, A. (2023). Forget subtitles: youtube now dubs videos with AI-
Eskimez, S. E., Maddox, R. K., Xu, C., and Duan, Z. “Generating talking face
generated voices. https://restofworld.org/2023/youtube-ai-dubbing-automated-
landmarks from speech,” in Proceedings of the Latent Variable Analysis and Signal
translation/.
Separation: 14th International Conference, LVA/ICA 2018, Guildford, UK, July 2018,
372–381. Oh, T.-H., Dekel, T., Kim, C., Mosseri, I., Freeman, W. T., Rubinstein, M., et al.
(2019). Speech2face: learning the face behind a voice. https://arxiv.org/abs/1905.09773.
Fried, O., Tewari, A., Zollhöfer, M., Finkelstein, A., Shechtman, E., Goldman, D. B.,
et al. (2019). Text-based editing of talking-head video. ACM Trans. Graph. (TOG) 38, Orero, P., Torner, A. F., et al. (2023). The visible subtitler: blockchain technology
1–14. doi:10.1145/3306346.3323028 towards right management and minting. Open Res. Eur. 3, 26. doi:10.12688/
openreseurope.15166.1
Gao, Y., Zhou, Y., Wang, J., Li, X., Ming, X., and Lu, Y. (2023). High-fidelity and freely
controllable talking head video generation. https://arxiv.org/abs/2304.10168. Pataranutaporn, P., Danry, V., Leong, J., Punpongsanon, P., Novy, D., Maes, P., et al.
(2021). Ai-generated characters for supporting personalized learning and well-being.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al.
Nat. Mach. Intell. 3, 1013–1022. doi:10.1038/s42256-021-00417-9
(2014). Generative adversarial nets. Adv. neural Inf. Process. Syst. 27.
Prajwal, K., Mukhopadhyay, R., Namboodiri, V. P., and Jawahar, C. “A lip sync expert
Guo, Y., Chen, K., Liang, S., Liu, Y.-J., Bao, H., and Zhang, J. (2021). Ad-nerf: audio
is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM
driven neural radiance fields for talking head synthesis. https://arxiv.org/abs/2103.
International Conference on Multimedia, Seattle, WA, USA, October, 2020, 484–492.
11078.
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. (2022).
Harte, N., and Gillen, E. (2015). Tcd-timit: an audio-visual corpus of continuous
Robust speech recognition via large-scale weak supervision. https://arxiv.org/abs/2212.04356.
speech. IEEE Trans. Multimedia 17, 603–615. doi:10.1109/TMM.2015.2407694
Richard, A., Zollhöfer, M., Wen, Y., De la Torre, F., and Sheikh, Y. (2021). Meshtalk:
Hayes, L., and Bolanos-Garcia-Escribano, A. (2022). Streaming English dubs: a
3d face animation from speech using cross-modality disentanglement. https://arxiv.org/
snapshot of netflix’s playbook: Conference: Transtextual and transcultural
abs/2104.08223.
circumnavigations. 10th international conference of aieti (iberian association for
translation and interpreting studies). Braga, portugal: universidade do minho. Roxborough, S. (2019). Netflix’s global reach sparks dubbing revolution: “the public
demands it”. Los Angeles, California, United States: The Hollywood Reporter.
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising diffusion probabilistic models. Adv.
neural Inf. Process. Syst. 33, 6840–6851. Shen, S., Zhao, W., Meng, Z., Li, W., Zhu, Z., Zhou, J., et al. (2023). Difftalk: crafting
diffusion models for generalized audio-driven portraits animation. https://arxiv.org/
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., and Mohamed,
abs/2301.03786.
A. (2021). Hubert: self-supervised speech representation learning by masked prediction
of hidden units. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 3451–3460. doi:10. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. “Deep
1109/TASLP.2021.3122291 unsupervised learning using nonequilibrium thermodynamics,” in Proceedings of
the International conference on machine learning (PMLR), Lille, France, July 2015,
Jain, S., and Srivastava, A. (2022). Copyright infringement in the era of digital world.
2256–2265.
Int’l JL Mgmt. Hum. 5, 1333.
Song, L., Liu, B., Yin, G., Dong, X., Zhang, Y., and Bai, J.-X. “Tacr-net: editing on deep Wu, H., Jia, J., Wang, H., Dou, Y., Duan, C., and Deng, Q. “Imitating arbitrary talking
video and voice portraits,” in Proceedings of the 29th ACM International Conference on style for realistic audio-driven talking face synthesis,” in Proceedings of the 29th ACM
Multimedia, Virtual Event China, October 2021, 478–486. doi:10.1109/TIFS.2022. International Conference on Multimedia, Virtual Event China, October 2021,
3146783 1478–1486.
Song, L., Wu, W., Qian, C., He, R., and Loy, C. C. (2022). Everybody’s talkin’: let me Xu, C., Zhu, S., Zhu, J., Huang, T., Zhang, J., Tai, Y., et al. (2023). Multimodal-driven
talk as you want. IEEE Trans. Inf. Forensics Secur. 17, 585–598. talking face generation, face swapping, diffusion model. https://arxiv.org/abs/2305.02594.
Song, Y., Zhu, J., Li, D., Wang, X., and Qi, H. (2018). Talking face generation by Yang, Y., Shillingford, B., Assael, Y., Wang, M., Liu, W., Chen, Y., et al. (2020). Large-
conditional recurrent adversarial network. https://arxiv.org/abs/1804.04786. scale multilingual audio visual dubbing. https://arxiv.org/abs/2011.03530.
Spiteri Miggiani, G. (2021). English-Language dubbing: challenges and quality Yao, X., Fried, O., Fatahalian, K., and Agrawala, M. (2021). Iterative text-based editing
standards of an emerging localisation trend. J. Specialised Transl. of talking-heads using neural retargeting. ACM Trans. Graph. (TOG) 40, 1–14. doi:10.
1145/3449063
Stypułkowski, M., Vougioukas, K., He, S., Zieba, M., Petridis, S., and Pantic, M.
(2023). Diffused heads: diffusion models beat gans on talking-face generation. https:// Yi, R., Ye, Z., Zhang, J., Bao, H., and Liu, Y.-J. (2020). Audio-driven talking face video
arxiv.org/abs/2301.03396. generation with learning-based personalized head pose. https://arxiv.org/abs/2002.10137.
Suwajanakorn, S., Seitz, S. M., and Kemelmacher-Shlizerman, I. (2017). Synthesizing Zhang, C., Zhao, Y., Huang, Y., Zeng, M., Ni, S., Budagavi, M., et al. (2021a). Facial:
obama: learning lip sync from audio. ACM Trans. Graph. (ToG) 36, 1–13. doi:10.1145/ synthesizing dynamic talking face with implicit attribute learning. https://arxiv.org/abs/
3072959.3073640 2108.07938.
Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, A. G., et al. (2017). A Zhang, W., Cun, X., Wang, X., Zhang, Y., Shen, X., Guo, Y., et al. (2023). Sadtalker:
deep learning approach for generalized speech animation. ACM Trans. Graph. (TOG) learning realistic 3d motion coefficients for stylized audio-driven single image talking
36, 1–11. doi:10.1145/3072959.3073699 face animation. https://arxiv.org/abs/2211.12194.
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., and Nießner, M. “Neural voice Zhang, X., Wang, J., Cheng, N., Xiao, E., and Xiao, J. (2022). “Shallow diffusion
puppetry: audio-driven facial reenactment,” in Proceedings of the Computer motion model for talking face generation from speech,” in Asia-pacific web (APWeb)
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 2020, 716–731. and web-age information management (WAIM) joint international conference on web
and big data (Berlin, Germany: Springer), 144–157.
Tulyakov, S., Liu, M.-Y., Yang, X., and Kautz, J. “Mocogan: decomposing motion and
content for video generation,” in Proceedings of the IEEE conference on computer Zhang, Z., Li, L., Ding, Y., and Fan, C. (2021b). Flow-guided one-shot talking face
vision and pattern recognition, Salt Lake City, UT, USA, June 2018, 1526–1535. generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. Glasgow, UK 3661–3670.
Vougioukas, K., Petridis, S., and Pantic, M. (2020). Realistic speech-driven facial
animation with gans. Int. J. Comput. Vis. 128, 1398–1413. doi:10.1007/s11263-019- Zhou, H., Liu, Y., Liu, Z., Luo, P., and Wang, X. (2019). Talking face generation by
01251-8 adversarially disentangled audio-visual representation. Proc. AAAI Conf. Artif. Intell.
33, 9299–9306. doi:10.48550/arXiv.1807.07860
Wang, K., Wu, Q., Song, L., Yang, Z., Wu, W., Qian, C., et al. “Mead: A large-scale
audio-visual dataset for emotional talking-face generation,” in Proceedings of the Zhou, H., Sun, Y., Wu, W., Loy, C. C., Wang, X., and Liu, Z. (2021). Pose-controllable
ECCV, Glasgow, UK, (August 2020. talking face generation by implicitly modularized audio-visual representation. https://
arxiv.org/abs/2104.11116.
Wang, S., Li, L., Ding, Y., Fan, C., and Yu, X. (2021). Audio2head: audio-driven one-
shot talking-head generation with natural head motion. https://arxiv.org/abs/2107. Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., and Li, D. (2020).
09293. Makelttalk: speaker-aware talking-head animation. ACM Trans. Graph. (TOG) 39,
1–15. doi:10.1145/3414685.3417774
Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., et al. (2017).
Tacotron: towards end-to-end speech synthesis. https://arxiv.org/abs/1703.10135. Zhou, Y., Xu, Z., Landreth, C., Kalogerakis, E., Maji, S., and Singh, K. (2018).
Visemenet: audio-driven animator-centric speech animation. ACM Trans. Graph.
Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. (2004). Image quality
(TOG) 37, 1–10. doi:10.1145/3197517.3201292
assessment: from error visibility to structural similarity. IEEE Trans. image Process. 13,
600–612. doi:10.1109/TIP.2003.819861 Zhu, H., Wu, W., Zhu, W., Jiang, L., Tang, S., Zhang, L., et al. (2022). CelebV-HQ: A
large-scale video facial attributes dataset. https://arxiv.org/abs/2207.12393.
Weitzman, C. (2023). Voice actor vs. AI voice: Pros and cons. Speechify. Section:
VoiceOver. St Petersburg, Florida, USA: Speechify. Zhua, Y., Zhanga, C., Liub, Q., and Zhoub, X. “Audio-driven talking head video
generation with diffusion model,” in Proceedings of the ICASSP 2023-2023 IEEE
Wen, X., Wang, M., Richardt, C., Chen, Z.-Y., and Hu, S.-M. (2020). Photorealistic
International Conference on Acoustics, Speech and Signal Processing (ICASSP),
audio-driven video portraits. IEEE Trans. Vis. Comput. Graph. 26, 3457–3466. doi:10.
Rhodes Island, Greece, June 2023, 1–5.
1109/TVCG.2020.3023573