Makeittalk: Speaker-Aware Talking-Head Animation
Makeittalk: Speaker-Aware Talking-Head Animation
Makeittalk: Speaker-Aware Talking-Head Animation
Input: audio and single portrait image Output: talking head animation
Fig. 1. Given an audio speech signal and a single portrait image as input (left), our model generates speaker-aware talking-head animations (right). Both the
speech signal and the input face image are not observed during the model training process. Our method creates both non-photorealistic cartoon animations (top)
and natural human face videos (bottom). Please also see our supplementary video. Cartoon character Wilk ©Dave Werner at Adobe Research. Natural face
James Stewart by studio publicity still (public domain).
We present a method that generates expressive talking-head videos from method, in addition to user studies, demonstrating generated talking-heads
a single facial image with audio as the only input. In contrast to previous of significantly higher quality compared to prior state-of-the-art methods. 1
attempts to learn direct mappings from audio to raw pixels for creating CCS Concepts: • Computing methodologies → Animation; Machine
talking faces, our method first disentangles the content and speaker infor- learning approaches.
mation in the input audio signal. The audio content robustly controls the
motion of lips and nearby facial regions, while the speaker information Additional Key Words and Phrases: Facial Animation, Neural Networks
determines the specifics of facial expressions and the rest of the talking-head ACM Reference Format:
dynamics. Another key component of our method is the prediction of facial Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kaloger-
landmarks reflecting the speaker-aware dynamics. Based on this intermedi- akis, and Dingzeyu Li. 2020. MakeItTalk: Speaker-Aware Talking-Head An-
ate representation, our method works with many portrait images in a single imation. ACM Trans. Graph. 39, 6, Article 221 (December 2020), 15 pages.
unified framework, including artistic paintings, sketches, 2D cartoon char- https://doi.org/10.1145/3414685.3417774
acters, Japanese mangas, and stylized caricatures. In addition, our method
generalizes well for faces and characters that were not observed during 1 INTRODUCTION
training. We present extensive quantitative and qualitative evaluation of our
Animating expressive talking-heads is essential for filmmaking, vir-
tual avatars, video streaming, computer games, and mixed realities.
Authors’ addresses: Yang Zhou, yangzhou@cs.umass.edu, University of Massachusetts
Despite recent advances, generating realistic facial animation with
Amherst; Xintong Han, hixintonghan@gmail.com, Huya Inc; Eli Shechtman, elishe@ little or no manual labor remains an open challenge in computer
adobe.com, Adobe Research; Jose Echevarria, echevarr@adobe.com, Adobe Research; graphics. Several key factors contribute to this challenge. Tradition-
Evangelos Kalogerakis, kalo@cs.umass.edu, University of Massachusetts Amherst;
Dingzeyu Li, dinli@adobe.com, Adobe Research. ally, the synchronization between speech and facial movement is
hard to achieve manually. Facial dynamics lie on a high-dimensional
manifold, making it nontrivial to find a mapping from audio/speech
© 2020 Association for Computing Machinery. [Edwards et al. 2016]. Secondly, different talking styles in multiple
This is the author’s version of the work. It is posted here for your personal use. Not for talking-heads can convey different personalities and lead to better
redistribution. The definitive Version of Record was published in ACM Transactions on
1 Our project page with source code, datasets, and supplementary video is available at
Graphics, https://doi.org/10.1145/3414685.3417774.
https://people.umass.edu/yangzhou/MakeItTalk/
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
221:2 • Zhou et al.
Table 1. A comparison of related works across to various criteria shown on the left. “Handle unseen faces” means handling faces (or rigged models and 3d
meshes) unobserved during training. “Single target image” means requiring only a single target image instead of a video for talking head generation.
Suwajanakorn Taylor et al. Karras et al. Zhou et al. Zhou et al. Vougioukas Chen et al. Thies et al.
Ours
et al. [2017] [2017] [2017] [2018] [2019] et al. [2019] [2019] [2020]
animation format image rigged model 3d mesh rigged model image image image image image
beyond lip animation ✓ × ✓ × ✓ ✓ ✓ ✓ ✓
head pose ✓ × × × × × × × ✓
speaker-awareness × × × × × × × ✓ ✓
handle unseen faces × ✓ × × ✓ × ✓ ✓ ✓
single target image × − − − ✓ ✓ ✓ × ✓
viewing experiences [Walker et al. 1997]. Last but not least, handling cartoon images, such as sketches, 2D cartoon characters, Japanese
lip syncing and facial animation are not sufficient for the perception mangas and stylized caricatures. In contrast, existing video synthe-
of realism of talking-heads. The entire facial expression considering sis methods and morphable models are limited to human faces and
the correlation between all facial elements and head pose also play cannot readily generalize to non-photorealistic or non-human faces
an important role [Faigin 2012; Greenwood et al. 2018]. These cor- and expressions.
relations, however, are less constrained by the audio and thus hard Our experiments demonstrate that our method achieves signifi-
to be estimated. cantly more accurate and plausible talking heads compared to prior
In this paper, we propose a new method based on a deep neural work qualitatively and quantitatively, especially in the challenging
architecture, called MakeItTalk, to address the above challenges. setting of animating static face images unseen during training. In
Our method generates talking-heads from a single facial image and addition, our ablation study demonstrates the advantages of dis-
audio as the only input. At test time, MakeItTalk is able to produce entangling speech content and speaker identity for speaker-aware
plausible talking-head animations with both facial expressions and facial animation.
head motions for new faces and voices not observed during training. In summary, given an audio signal and a single portrait image as
Mapping audio to facial animation is challenging, since it is not a input (both unseen during training), our method generates expres-
one-to-one mapping. Different speakers can have large variations sive talking-head animations. We highlight the following contribu-
in head pose and expressions given the same audio content. The tions:
key insight of our approach is to disentangle the speech content
and speaker identity information in the input audio signal. The • We introduce a new deep-learning based architecture to pre-
content captures the phonetic and prosodic information in the input dict facial landmarks, capturing both facial expressions and
audio and is used for robust synchronization of lips and nearby facial overall head poses, from only speech signals.
regions. The speaker information captures the rest of the facial ex- • We generate speaker-aware talking-head animations based
pressions and head motion dynamics, which tend to be characteristic on disentangled speech content and speaker information,
for the speaker and are important for generating expressive talking- inspired by advances from voice conversion.
head animation. We demonstrate that this disentanglement leads to • We present two landmark-driven image synthesis methods
significantly more plausible and believable head animations. To deal for non-photorealistic cartoon images and human face images.
with the additional challenge of producing coherent head motion, These methods can handle new faces and cartoon characters
we propose a combination of LSTM and self-attention mechanisms not observed during training.
to capture both short and long-range temporal dependencies in head • We propose a set of quantitative metrics and conduct user
pose. studies for evaluation of talking-head animation methods.
Another key component of our method is the prediction of facial
landmarks as an intermediate representation incorporating speaker- 2 RELATED WORK
aware dynamics. This is in contrast to previous approaches that In computer graphics, we have a long history of cross-modal syn-
attempt to directly generate raw pixels or 3D morphable face models thesis. More than two decades ago, Brand [1999] pioneered Voice
from audio. Leveraging facial landmarks as the intermediate repre- Puppetry to generating full facial animation from an audio track.
sentation between audio to visual animation has several advantages. Music-driven dance generation from Shiratori et al. [2006] matched
First, based on our disentangled representations, our model learns the rhythm and intensity from the input melody to full-body dance
to generate landmarks that capture subtle, speaker-dependent dy- motions. Gan et al. [2020a,b] separate sound sources and generate
namics, sidestepping the learning of low-level pixel appearance that synchronized musics from videos with people playing instruments.
tends to miss those. The degrees of freedom (DoFs) for landmarks Recently, Ginosar et al. [2019] predicted conversational gestures and
is in the order of tens (68 in our implementation), as opposed to skeleton movements from speech signals. We focus on audio-driven
millions of pixels in raw video generation methods. As a result, our facial animation, which can supplement other body and gesture
learned model is also compact, making it possible to train it from prediction methods. In the following paragraphs, we overview prior
moderately sized datasets. Last but not the least, the landmarks can work on facial landmark synthesis, facial animation, and video gen-
be easily used to drive a wide range of different types of anima- eration. Table 1 summarizes differences of methods that are most
tion content, including human face images and non-photorealistic related to ours based on a set of key criteria.
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
MakeItTalk: Speaker-Aware Talking-Head Animation • 221:3
Cartoon
Single Portrait Facial Landmarks Speech Content Animation Face Warp
Image
Image
LSTM MLP + OR
Voice Conversion
Content displacement
Content Predicted
Embedding Image2Image
Landmarks Translation
Fig. 2. Pipeline of our method (“MakeItTalk”). Given an input audio signal along with a single portrait image (cartoon or real photo), our method animates the
portrait in a speaker-aware fashion driven by disentangled content and speaker embeddings. The animation is driven by intermediate predictions of 3D
landmark displacements. The “speech content animation” module maps the disentangled audio content to landmark displacements synchronizing the lip, jaw,
and nearby face regions with the input speech. The same set of landmarks is further modulated by the “speaker-aware animation” branch that takes into
account the speaker embedding to capture the rest of the facial expressions and head motion dynamics.
Audio-driven facial landmark synthesis. Eskimez et al. [2018, 2019] “Style”-aware facial head animation. Suwajanakorn et al. [2017]
generated synchronized facial landmarks with robust noise resilience used a re-timing dynamic programming method to reproduce speaker
using deep neural networks. Later, Chen et al. [2019] trained decou- motion dynamics. However, it was specific to a single subject (Obama),
pled blocks to obtain landmarks first and then generate rasterized and does not generalize to faces other than Obama’s. In another ear-
videos. Attention masks are used to focus on the most changing lier work, Liu et al. [2015] used color, depth and audio to reproduce
parts on the face, especially the lips. Greenwood et al. [2018] jointly the facial animation of a speaker recorded with a RGBD sensor. How-
learnt facial expressions and head poses in terms of landmarks from ever, it does not generalize to other unseen speakers. Cudeiro et al.
a forked Bi-directional LSTM network. Most previous audio-to-face- [2019] attempted to model speaker style in a latent representation.
animation work focused on matching speech content and left out Thies et al. [2020] encoded personal style in static blendshape bases.
style/identity information since the identity is usually bypassed Both methods, however, focus on lower facial animation, especially
due to mode collapse or averaging during training. In contrast, our lips, and do not predict head pose. More similar to ours, Zhou et al.
approach disentangles audio content and speaker information, and [2019] learned a joint audio-visual representation to disentangle the
drives landmarks capturing speaker-dependent dynamics. identity and content from the image domain. However, their identity
information primarily focus on static facial appearance and not the
speaker dynamics. As we demonstrate in §5.5, speaker awareness
encompasses many aspects beyond mere static appearances. The
Lip-sync facial animation. With the increasing power of GPUs,
individual facial expressions and head movements are both impor-
we have seen prolific progress on end-to-end learning from audio to
tant factors for speaker-aware animations. Our method addresses
video frames. [Chen et al. 2018] synthesized cropped lip movements
speaker identity by learning jointly the static appearance and head
for each frames. Chung et al. [2017]; Song et al. [2019]; Vougioukas
motion dynamics, to deliver faithfully animated talking-heads.
et al. [2019] generated full natural human face images with GANs
or encoder-decoder CNNs. [Pham et al. 2018] estimated blendshape Warpping-based character animation. Fišer et al. [2017] and Averbuch-
parameters. Taylor et al. [2017]; Zhou et al. [2018] demonstrated Elor et al. [2017] demonstrated animation of portraits driven by
audio-driven talking portraits for rigged face models, however, the videos and extracted landmarks of human facial performance. Weng
input cartoon models required manual rigging and retargeting, as et al. [2019] presented a system for animating the body of an input
well as artist interventions for animating the rest of the head beyond portrait by fitting a human template, then animated it using motion
lips. Our method is instead able to automatically animate an input capture data. In our case, we aim to synthesize facial expressions
portrait and does not require such manual inputs. In addition, the and head pose from audio alone.
above methods do not capture speaker identity or style. As a result,
if the same sentence is spoken by two different voices, they will Evaluation metrics. Quantitatively evaluating the learned iden-
tend to generate the same facial animation lacking the dynamics tity/style is crucial for validation; at the same time, it is nontriv-
required to make it more expressive and realistic. ial to setup an appropriate benchmark. Many prior work resorted
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
221:4 • Zhou et al.
to subjective user studies [Cudeiro et al. 2019; Karras et al. 2017]. The content is speaker-agnostic and captures the general motion of
Agarwal et al. [2019] visualized the style distribution via action lips and nearby regions (Figure 2, Speech Content Animation, §3.1).
units. For existing quantitative metrics, they primarily focus on The identity of the speaker determines the specifics of the motions
pixel-level artifacts since a majority of the network capacity is used and the rest of the talking-head dynamics (Figure 2, Speaker-Aware
to learn pixel generation rather than high-level dynamics [Chen Animation, §3.2). For example, no matter who speaks the word ‘Ha!’,
et al. 2019]. Action units have been proposed to be an alternative the lips are expected to be open, which is speaker-agnostic and
evaluation measure of expression in the context of GAN-based ap- only dictated by the content. As for the exact shape and size of the
proaches [Pumarola et al. 2018]. We propose a new collection of opening, as well as the motion of nose, eyes and head, these will
metrics to evaluate the high-level dynamics that matter for facial depend on who speaks the word, i.e., identity. Conditioned on the
expression and head motion. content and speaker identity information, our deep model outputs
a sequence of predicted landmarks for the given audio.
Image-to-image translation. Image-to-image translation is a very To generate rasterized images, we discuss two algorithms for the
common approach to modern talking face synthesis and editing. landmark-to-image synthesis (§3.3). For non-photorealistic images
Face2Face and VDub are among the early explorers to demonstrate like paintings or vector arts (Fig. 7), we use a simple image warping
robust appearance transfer between two talking-head videos [Gar- method based on Delaunay triangulation inspired by [Averbuch-
rido et al. 2015; Thies et al. 2016]. Later, adversarial training was Elor et al. 2017] (Figure 2, Face Warp). For natural images (Fig. 8), we
adopted to improve the quality of the transferred results. For exam- devised an image-to-image translation network, inspired by pix2pix
ple, Kim et al. [2019] used cycle-consistency loss to transfer styles [Isola et al. 2017]) to animate the given human face image with
and showed promising results on one-to-one transfers. Zakharov the underlying landmark predictions (Figure 2, Image2Image Trans-
et al. [2019] developed a few-shot learning scheme that leveraged lation). Combining all the image frames and input audio together
landmarks to generate natural human facial animation. Based on gives us the final talking-head animations. In the following sections,
these prior works, we also employ an image-to-image translation we describe each module of our architecture.
network to generate natural human talking-head animations. Unlike
Zakharov et al. [2019], our model handles generalization to faces
unseen during training without the need of fine-tuning. Addition- 3.1 Speech Content Animation
ally, we are able to generate non-photorealistic images through an To extract the speaker-agnostic content representation of the audio,
image deformation module. we use AutoVC encoder from Qian et al. [2019]. The AutoVC network
utilizes an LSTM-based encoder that compresses the input audio
Disentangled learning. Disentanglement of content and style in into a compact representation (bottleneck) trained to abandon the
audio has been widely studied in the voice conversion community. original speaker identity but preserve content. In our case, we extract
Without diving into its long history (see [Stylianou 2009] for a de- a content embedding A ∈ R𝑇 ×𝐷 from AutoVC network, where 𝑇
tailed survey), here we only discuss recent methods that fit into our is the total number of input audio frames, and 𝐷 is the content
deep learning pipeline. Wan et al. [2018] developed Resemblyzer dimension.
as a speaker identity embedding for verification purposes across The goal of the content animation component is to map the con-
different languages. Qian et al. [2019] proposed AutoVC, a few-shot tent embedding A to facial landmark positions with a neutral style.
voice conversion method to separate the audio into the speech con- In our experiments, we found that recurrent networks are much
tent and the identity information. As a baseline, we use AutoVC better suited for the task than feedforward networks, since they
for extracting voice content and Resemblyzer for extracting fea- are designed to capture such sequential dependencies between the
ture embeddings of speaker identities. We introduce the idea of audio content and landmarks. We experimented with vanilla RNNs
voice conversion to audio-driven animation and demonstrate the and LSTMs [Graves and Jaitly 2014], and found that LSTMs offered
advantages of speaker-aware talking-head generation. better performance. Specifically, at each frame 𝑡, the LSTM module
takes as input the audio content A within a window [𝑡 → 𝑡 + 𝜏].
3 METHOD We set 𝜏 = 18 frames (a window size of 0.3s in our experiments). To
Overview. As summarized in Figure 2, given an audio clip and animate any input 3D static landmarks q, where q ∈ R 68×3 that are
a single facial image, our architecture, called “MakeItTalk”, gen- extracted using a landmark detector, the output from LSTM layers
erates a speaker-aware talking-head animation synchronized with is fed into a Multi-Layer Perceptron (MLP) and finally predicts dis-
the audio. In the training phase, we use an off-the-shelf face 3D placements Δq𝑡 , which put the input landmarks in motion at each
landmark detector to preprocess the input videos to extract the frame.
landmarks [Bulat and Tzimiropoulos 2017]. A baseline model to To summarize, the speech content animation module models
animate the speech content can be trained from the input audio and sequential dependencies to output landmarks based on the following
the extracted landmarks directly. However, to achieve high-fidelity transformations:
dynamics, we found that landmarks should instead be predicted
from a disentangled content representation and speaker embedding
c𝑡 = 𝐿𝑆𝑇 𝑀𝑐 A𝑡 →𝑡 +𝜏 ; w𝑙𝑠𝑡𝑚,𝑐 , (1)
of the input audio signal. Δq𝑡 = 𝑀𝐿𝑃𝑐 (c𝑡 , q; w𝑚𝑙𝑝,𝑐 ), (2)
Specifically, we use a voice conversion neural network to disen-
tangle the speech content and identity information [Qian et al. 2019]. p𝑡 = q + Δq𝑡 , (3)
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
MakeItTalk: Speaker-Aware Talking-Head Animation • 221:5
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
221:6 • Zhou et al.
𝐩%&
𝐩",$
ℒ 𝐩",$ 𝐩%'
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
MakeItTalk: Speaker-Aware Talking-Head Animation • 221:7
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
221:8 • Zhou et al.
Fig. 6. Generated talking-head animation gallery for non-photorealistic cartoon faces (left) and natural human faces (right). The corresponding intermediate
facial landmark predictions are also shown on the right-bottom corner of each animation frame. Our method synthesizes not only facial expressions, but also
different head poses. Cartoon Man with hat and Girl with brown hair ©Yang Zhou. Natural face (at right bottom corner) from VoxCeleb2 dataset [Chung et al.
2018] ©Visual Geometry Group (CC BY).
Artistic painting 2D cartoon Random sketch Japanese manga Stylized caricature Casual photo
Single cartoon image
Generated samples
Fig. 7. Our model works for a variety types of non-photorealistic (cartoon) portrait images, including artistic paintings, 2D cartoon characters, random
sketches, Japanese mangas, stylized caricatures and casual photos. Top row: input cartoon images. Next rows: generated talking face examples by face warping.
Please also see our supplementary video. Artistic painting Girl with a pearl earring ©Johannes Vermeer (public domain). Random sketch ©Yang Zhou. Japanese
manga ©Gwern Branwen (CC-0). Stylized caricature ©Daichi Ito at Adobe Research.
In the following sections, we discuss more results for generat- 5.1 Animating Non-Photorealistic Images
ing non-photorealistic animations, and natural human facial video, Figure 7 shows a gallery of our generated non-photorealistic anima-
along with qualitative comparisons. Then we present detailed nu- tions. Each animation is generated from input audio and a single
merical evaluation, an ablation study, and applications of our method. portrait image. The portrait images can be artistic paintings, ran-
dom sketches, 2D cartoon characters, Japanese mangas, stylized
caricatures and casual photos. Despite being only trained on human
facial landmarks, our method can successfully generalize to a large
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
MakeItTalk: Speaker-Aware Talking-Head Animation • 221:9
GT
GT
(cropped)
[Vougioukas et
al. 2019]
[Chen et al.
2019]
Ours
(cropped)
Ours
Source speaker with slight head motion Source speaker with active head motion
Fig. 8. Comparison with state-of-the-art methods for video generation of human talking-heads. The compared methods crop the face and predict primarily
the lip region while ours generates both facial expression and head motion. GT and our results are full faces and are cropped for a better visualization of the lip
region. Left example: Chen et al. [2019] has worse lip synchronization for side-faces (see the red box). Right example: our method predicts speaker-aware head
pose dynamics (see the green box). Note that the predicted head pose is different than the one in the ground-truth video, but it exhibits similar dynamics that
are characteristic for the speaker. Natural faces from VoxCeleb2 dataset [Chung et al. 2018] ©Visual Geometry Group (CC BY).
variety of stylized cartoon faces. This is because our method uses or rendered images of 3D models. Please check our supplementary
landmarks as intermediate representations, and also learns relative video for results.
landmark displacements instead of absolute positions. Our supplementary video also includes a comparison with the
concurrent work by [Thies et al. 2020]. Given the same audio and
target speaker image at test time, we found that our lip synchro-
5.2 Animating Human Facial Images
nization appears to be more accurate than their method. We also
Figure 8 shows synthesized talking head videos featuring talking emphasize that our method learns to generate head poses, while
people as well as comparisons with state-of-the-art video generation in their case the head pose is not explicitly handled or is added
methods [Chen et al. 2019; Vougioukas et al. 2019]. The ground- back heuristically in a post-processing step (not detailed in their
truth (GT) and our results are cropped to highlight the differences in paper). Their synthesized video frames appear sharper than ours
the lip region (see row 2 and 5). We notice that the videos generated perhaps due to their neural renderer of their 3D face model. How-
by Chen et al. [2019]; Vougioukas et al. [2019] predict primarily ever, their method requires additional training on target-specific
the lip region on cropped faces and therefore miss the head poses. reference videos of length around 30 hours, while ours animates
Vougioukas et al. [2019] does not generalize well to faces unseen a single target photo immediately without any retraining. Thus,
during training. Chen et al. [2019] lacks synchronization with the our network has the distinctive advantage of driving a diverse set
input audio, especially for side-facing portraits (see the red box). of single images for which long training videos are not available.
Compared to these methods, our method predicts facial expressions These include static cartoon characters, casual photos, paintings,
more accurately and also captures head motion to some degree (see and sketches.
the green box). On the other hand, to be fair, we also note that
our method is not artifact-free: the head motion often distorts the
background, which is due to the fact that our network attempts 5.3 Evaluation Protocol
to produce the whole image, without separating foreground from We evaluated MakeItTalk and compared with related methods quan-
background. titatively and with user studies. We created a test split from the
Even only trained on natural face images, our image translation VoxCeleb2 subset, containing 268 video segments from 67 speak-
module can also generate plausible facial animations not only lim- ers. The speaker identities were observed during training, however,
ited to real faces, but also to 2D paintings, picture of statue heads, their test speech and video are different from the training ones.
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
221:10 • Zhou et al.
Table 2. Quantitative comparison of facial landmark predictions of 5.4 Content Animation Evaluation
MakeItTalk versus state-of-the-art methods.
We begin our evaluation by comparing MakeItTalk with state-of-the-
Methods D-LL ↓ D-VL ↓ D-A ↓ art methods for synthesis of facial expressions driven by landmarks.
[Zhou et al. 2018] 6.2% 0.63% 15.2% Specifically, we compare with Eskimez et al. [2018], Zhou et al.
[Eskimez et al. 2018] 4.0% 0.42% 7.5% [2018], and Chen et al. [2019]. All these methods attempt to synthe-
[Chen et al. 2019] 5.0% 0.41% 5.0% size facial landmarks, but cannot produce head motion. Head move-
Ours (no separation) 2.9% 0.64% 17.1% ments are either generated procedurally or copied from a source
Ours (no speaker branch) 2.2% 0.29% 5.9% video. Thus, to perform a fair evaluation, we factor out head motion
from our method, and focus only on comparing predicted landmarks
Ours (no content branch) 3.1% 0.38% 10.2%
under an identical “neutral” head pose for all methods. For the pur-
Ours (full) 2.0% 0.27% 4.2%
pose of this evaluation, we focus on the lip synchronization metrics
(D-LL, D-VL, D-A), and ignore head pose metrics. Quantitatively,
Table 2 reports these metrics for the above-mentioned methods and
ours. We include our full method, including three reduced variants:
Each video clip lasts 5 to 30 seconds. Landmarks were extracted us- (a) “Ours (no separation)”, where we eliminate the voice conver-
ing [Bulat and Tzimiropoulos 2017] from test clips and their quality sion module and feed the raw audio features as input directly to
was manually verified. We call these as “reference landmarks” and the speaker-aware animation branch trained and tested alone; in
use them in the evaluation metrics explained below. this manner, there is no separation (disentanglement) between au-
dio content and speaker identity, (b) “Ours (no speaker branch)”,
Evaluation Metrics. To evaluate how well the synthesized land-
where we keep the voice conversion module for disentanglement,
marks represent accurate lip movements, we use the following met-
but we train and test the speech content branch alone without the
rics:
speaker-aware branch, (c) “Ours (no content branch)”, where we
• Landmark distance for jaw-lips (D-LL) represents the av- again perform disentanglement, but we train and test the speaker-
erage Euclidean distance between predicted facial landmark aware branch alone without the speech content branch. We discuss
locations of the jaw and lips and reference ones. The landmark these three variants in more detail in our ablation study (Section
positions are normalized according to the maximum width of 5.6). The result shows that our method achieves the lowest errors
the reference lips for each test video clip. for all measures. In particular, our method has 2x times less D-LL
• Landmark velocity difference for jaw-lips (D-VL) repre- error in lip landmark positions compared to [Eskimez et al. 2018],
sents the average Euclidean distance between reference land- and 2.5x times less D-LL error compared to [Chen et al. 2019].
mark velocities of the jaw and lips and predicted ones. Velocity is Figure 9 shows characteristic examples of facial landmark outputs
computed as the difference of landmark locations between con- for the above methods and ours from our test set. Each row shows
secutive frames. The metric captures differences in first-order one output frame. Zhou et al. [2018] is only able to predict the lower
jaw-lips dynamics. part of the face and cannot reproduce closed mouths accurately (see
• Difference in open mouth area (D-A:): the average differ- second row). Eskimez et al. [2018] and Chen et al. [2019], on the
ence between the area of the predicted mouth shape and refer- other hand, tend to favor conservative mouth opening. In particular,
ence one. It is expressed as percentage of the maximum area of Chen et al. [2019] predicts bottom and upper lips that sometimes
the reference mouth for each test video clip. overlap with each other (see second row, red box). In contrast, our
method captures facial expressions that match better the reference
To evaluate how well the landmarks produced by our method
ones. Ours can also predict subtle facial expressions, such as the
and others reproduce overall head motion, facial expressions, and
lip-corner lifting (see first row, red box).
their dynamics, we use the following metrics:
• Landmark distance (D-L): the average Euclidean distance be-
5.5 Speaker-Aware Animation Evaluation
tween all predicted facial landmark locations and reference ones
(normalized by the width of the face). Head Pose Prediction and Speaker Awareness. Existing speech-
• Landmark velocity difference (D-V): the average Euclidean driven facial animation methods do not synthesize head motion.
distance between reference landmark velocities and predicted Instead, a common strategy is to copy head poses from another
ones (again normalized by the width of the face). Velocity is existing video. Based on this observation, we evaluate our method
computed as the difference of landmark locations between con- against two baselines: “retrieve-same ID” and “retrieve-random ID”.
secutive frames. This metric serves as an indicator of landmark These baselines retrieve the head pose and position sequence from
motion dynamics. another video clip randomly picked from our training set. Then the
• Head rotation and position difference (D-Rot/Pos): the av- facial landmarks are translated and rotated to reproduce the copied
erage difference between the reference and predicted head ro- head poses and positions. The first baseline “retrieve-same ID” uses
tation angles (measured in degrees) and head position (again a training video with the same speaker as in the test video. This
normalized by the width of the face). The measure indicates strategy makes this baseline stronger since it re-uses dynamics from
head pose differences, like nods and tilts. the same speaker. The second baseline “retrieve-random ID” uses
a video from a different random speaker. This baseline is useful to
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
MakeItTalk: Speaker-Aware Talking-Head Animation • 221:11
/er/
•• Speakero
Speaker 1
•• Speaker2
Speaker3
/m/ •• Speaker 4
Speakers
• Speaker 6
Speaker 7
/ah/
Ground-truth
[Zhou et al. [Eskimez et al. [Chen et al. Ours Fig. 10. t-SNE visualization for AUs, head pose and position variance based
2018] 2018] 2019]
on 8 reference speakers videos (solid dots) and our predictions (stars). Dif-
Fig. 9. Facial expression landmark comparison. Each row shows an exam- ferent speakers are marked with different colors as shown in the legend.
ple frame prediction for different methods. The GT landmark and uttered
phonemes are shown on left. Table 3. Head pose prediction comparison with the baseline methods in
§5.5 and our variants based on our head pose metrics.
examine whether our method and alternatives produce head pose Methods D-L ↓ D-V ↓ D-Rot/Pos ↓
and facial expressions better than random or not. retrieve-same ID 17.1% 1.2% 10.3/8.1%
Table 3 reports the D-L, D-V, and D-Rot/Pos metrics. Our full retrieve-random ID 20.8% 1.1% 21.4/9.2%
method achieves much smaller errors compared to both baselines,
indicating our speaker-aware prediction is more faithful compared Ours (no separation) 12.4% 1.1% 8.8/5.4%
to merely copying head motion from another video. In particular, we Ours (random ID) 33.0% 2.4% 28.7/12.3%
observe that our method produces 2.7𝑥 less error in head pose (D- Ours (no speaker branch) 13.8% 1.2% 12.6/6.9%
Rot), and 1.7𝑥 less error in head position (D-Pos) compared to using Ours (no content branch) 12.5% 0.9% 8.6/5.7%
a random speaker identity (see “retrieve-random ID”). This result Ours (full) 12.3% 0.8% 8.0/5.4%
also confirms that the head motion dynamics of random speakers
largely differ from ground-truth ones. Compared to the stronger
baseline of re-using video from the same speaker (see “retrieve-same method produces head motion dynamics that tend to be located
ID”), we observe that our method still produces 1.3𝑥 less error in more closely to reference ones.
head pose (D-Rot), and 1.5𝑥 less error in head position (D-Pos). This
result confirms that re-using head motion from a video clip even 5.6 Ablation study
from the right speaker still results in significant discrepancies, since Individual branch performance. We performed an ablation study
the copied head pose and position does not necessarily synchronize by training and testing the three reduced variants of our network
well with the audio. Our full method instead captures the head described in §5.4: “Ours (no separation)” (no disentanglement be-
motion dynamics and facial expressions more consistently w.r.t. the tween content and speaker identity), “Ours (no speaker branch)”,
input audio and speaker identity. and “Ours (no content branch)”. The aim of the last two variants is to
Figure 6 shows a gallery of our generated cartoon images and check whether a single network can jointly learn both lip synchro-
natural human faces under different predicted head poses. The cor- nization and speaker-aware head motion. The results of these three
responding generated facial landmarks are also shown on the right- variants and our full method are shown in in Table 2 and Table 3.
bottom corner of each image. The demonstrated examples show We also refer readers to the supplementary video for more visual
that our method is able to synthesize head pose well, including nods comparisons.
and swings. Figure 10 shows another qualitative validation of our The variant “Ours (no speaker branch)” only predicts facial land-
method’s ability to capture personalized head motion dynamics. marks from the audio content without considering the speaker
The figure embeds 8 representative speakers from our dataset based identity. It performs well in terms of capturing the lip landmarks,
on their variance in Action Units (AUs), head pose and position vari- since these are synchronized with the audio content. The variant
ance. The AUs are computed from the predicted landmarks based is slightly worse than our method based on the lip evaluation met-
on the definitions from [Ekman and Friesen 1978]. The embedding rics (see Table 2). However, it results in 1.6𝑥 larger errors in head
is performed through t-SNE [Maaten and Hinton 2008]. These 8 pose and 1.3𝑥 larger error in head position (see Table 3) since head
representatives were selected using furthest sampling i.e., their AUs, motion is a function of both speaker identity and content.
head pose and position differ most from the rest of the speakers in The results of the variant “Ours (no content branch)” has the op-
our dataset. We use different colors for different speakers and use posite behaviour: it performs well in terms of capturing head pose
solid dots for embeddings produced based on the reference videos and position (it is slightly worse than our method, see Table 3).
in AUs, head pose and position variance, and stars for embeddings However, it has 1.6𝑥 higher error in jaw-lip landmark difference and
resulting from our method. The visualization demonstrates that our 2.4𝑥 higher error in open mouth area difference (see Table 2), which
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
221:12 • Zhou et al.
Ours (full)
on the bottom. Which of the two cartoon animations best represents
Input image
the person’s talking style in terms of facial expressions and head
motion?” The MTurk participants were asked to pick one of the
/w/ /e/ /silence/ following choices: “left animation”, “right animation”, “can’t tell -
both represent the person quite well”, “can’t tell - none represent
Fig. 11. Comparison to “Ours (no content branch)” variant (right-top) which the person well”. Each MTurk participant was asked to complete
uses only the speaker-aware animation branch. The full model (right-bottom)
a questionnaire with 20 queries randomly picked from our pool.
result has much better articulation in the lower-part of the face. It demon-
Queries were shown at a random order. Each query was repeated
strates that a single network architecture cannot jointly learn both lip
synchronization and speaker-aware head motion. Audrey Hepburn ©Me twice (i.e., we had 10 unique queries per questionnaire), with the
Pixels (CC-0). two cartoon videos randomly flipped each time to detect unreliable
participants giving inconsistent answers. We filtered out unreliable
MTurk participants who gave two different answers to more than 5
indicates that the lower part of face dynamics are not synchronized
out of the 10 unique queries in the questionnaire, or took less than a
well with the audio content. Figure 11 demonstrates that using the
minute to complete it. Each participant was allowed to answer one
speaker-aware animation branch alone i.e., the “Ours (no content)”
questionnaire maximum to ensure participant diversity. We had 90
variant results in noticeable artifacts in the jaw-lip landmark dis-
different, reliable MTurk participants for this user study. For each of
placements. Using both branches in our full method offers the best
our 300 queries, we got votes from 3 different MTurk participants.
performance according to all evaluation metrics.
Since each MTurk participant voted for 10 unique queries twice, we
The results of the variant “Ours (no separation)” are similar to
gathered 1800 responses (300 queries × 3 votes × 2 repetitions) from
the variant “Ours (no content branch)”: it achieves slightly worse
our 90 MTurk participants. Figure 12 (top) shows the study result.
head pose performance than our full method (Table 3), and much
We see that the majority of MTurkers picked our full method more
worse results in terms of lip movement accuracy (Table 2). Specif-
frequently, when compared with either of the two variants.
ically, it has 1.5𝑥, 2.4𝑥, and 4.1𝑥 higher error in jaw-lip landmark
position, velocity, and open mouth area difference respectively. We
hypothesize this is because the content and the speaker identity
information are still entangled and therefore it is hard for the net-
work to disambiguate a one-to-one mapping between audio and
User study for natural human facial video. To validate our landmark-
landmarks.
driven human facial animation method, we conducted one more
Random speaker ID injection. We tested one more variant of our user study. Each MTurk participant was shown a questionnaire with
method called “Ours (random ID)”. For this variant, we use our full 20 queries involving random pairwise comparisons out of a pool
network, however, instead of using the correct speaker embedding, with 780 queries we generated. For each query, we showed a single
we inject another random speaker identity embedding. The result frame showing the head of a real person on top, and two generated
of this variant is shown in Table 3. Again we observe that the per- videos below (randomly placed at left/right positions): one video
formance is significantly worse (3.6x more error for head pose). synthesized from our method, and another from either Vougioukas
This indicates that our method successfully splits the content and et al. [2019] or Chen et al. [2019]. The participants were asked which
speaker-aware motion dynamics, and captures the correct speaker person’s facial expression and head motion look more realistic and
head motion dynamics (i.e., it does not reproduce random ones). plausible. We also explicitly instruct them to ignore the particular
camera position or zoom factor and focus on the face. Participants
5.7 User Studies were asked to pick one of four choices (“left”, “right”, “both” or
We also evaluated our method through perceptual user studies “none”) as in the previous study. We also employed the same random
via Amazon Mechanical Turk service. We obtained 6480 query re- and repeated query design and MTurker consistency and reliability
sponses from 324 different MTurk participants in our two different checks to filter out unreliable answers. We had 234 different MTurk
user studies described below. participants for this user study. Like in the previous study, each
query received votes from 3 different, reliable MTurk participants.
User study for speaker awareness. Our first study evaluated the As a result, we gathered 780 queries × 3 votes × 2 repetitions =
speaker awareness of different variants of our method while synthe- 4680 responses from our 234 participants. Figure 12 (bottom) shows
sizing cartoon animations. Specifically, we assembled a pool of 300 the study result. Our method was voted as the most “realistic” and
queries displayed on different webpages. On top of the webpage, we “plausible” by a large majority, when compared to Chen et al. [2019]
showed a reference video of a real person talking, and on the bottom or Vougioukas et al. [2019].
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
MakeItTalk: Speaker-Aware Talking-Head Animation • 221:13
Prefer left None are good Both are good Prefer right
Fig. 12. User study results for speaker awareness (top) and natural human
facial animation (bottom).
5.8 Applications
Synthesizing plausible talking-heads has numerous applications
[Kim et al. 2019; Zhou et al. 2019]. One common application is dub-
bing using voice from a person different from the one in the original Single user profile image
video, or even using voice in a different language. In Figure 13(top), (natural face / cartoon) (b) Video conference talking frames
given a single frame of the target actor and a dubbing audio spo- Fig. 13. Applications. Top row: video dubbing for target actor given only
ken by another person, we can generate a video of the target actor audio as input. Middle and bottom row: video conference for natural human
talking according to that other person’s speech. and cartoon user profile images. Please also see our supplementary video.
Another application is bandwidth-limited video conference. In Video conference application natural face ©PxHere (CC-0).
scenarios where the visual frames cannot be delivered with high
fidelity and frame-rate, we can make use only of the audio signal to expressive animations with higher overall quality compared to the
drive the talking-head video. Audio signal can be preserved under state-of-the-art.
much lower bandwidth compared to its visual counterpart. Yet, it is
Limitations and Future Work. There are still several avenues for fu-
still important to preserve visual facial expressions, especially lip
ture research. Although our method captures aspects of the speaker’s
motions, since they heavily contribute to understanding in commu-
style e.g., predicting head swings reflecting active speech, there are
nication [McGurk and MacDonald 1976]. Figure 13(middle) shows
several other factors that can influence head motion dynamics. For
that we can synthesize talking heads with facial expressions and
example, the speaker’s mood can also play a significant role in
lip motions with only the audio and an initial high-quality user
determining head motion and facial expressions. Further incorpo-
profile image as input. Figure 13(bottom) shows examples of both
rating sentiment analysis into the animation pipeline is a promising
natural human and cartoon talking-head animation that can be used
research direction.
in teleconferencing for entertainment reasons, or due to privacy
Our speech content animation currently does not always capture
concerns related to video recording. We also refer readers to the
well the bilabial and fricative sounds, i.e. /b/m/p/f/v/. We believe this
supplementary video.
is caused by the voice conversion module that we used, which tends
Our supplementary video also demonstrates a text-to-video ap-
to miss those short phoneme representations when performing voice
plication, where we synthesize natural human face video from text
spectrum reconstruction. While we observe promising results via
input, after converting it to audio through a speech synthesizer
directly adopting a state-of-the-art voice conversion architecture, a
[Notevibes 2020]. Finally, our video demonstrates the possibility of
more domain-specific adaptation for audio-driven animation may
interactively editing the pose of our synthesized talking heads by
address such discrepancies between voice conversion and animation.
applying a rotation to the intermediate landmarks predicted by our
Improving the image translation from landmarks to videos can
network.
also be further investigated. The current image-to-image translation
network takes only the 2D facial landmarks as input. Incorporating
6 CONCLUSION more phoneme- or viseme-related features as input may improve
We have introduced a deep learning based approach to generate the quality of the generated video in terms of articulation. Moreover,
speaker-aware talking-head animations from an audio clip and a background distortion and artifacts are noticeable in our current
single image. Our method can handle new audio clips and new solution. Our image translation module warps both the background
portrait images not seen during training. Our key insight was to and the foreground head to produce the animation, which gives an
predict landmarks from disentangled audio content and speaker, impression of a camera motion mixed with head motion. Adapt-
such that they capture better lip synchronization, personalized facial ing fore/background separation or portrait matting in the network
expressions and head motion dynamics. This led to much more architecture during training and testing may help generate better
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
221:14 • Zhou et al.
results [Sengupta et al. 2020]. Capturing long-range temporal and Hadar Averbuch-Elor, Daniel Cohen-Or, Johannes Kopf, and Michael F Cohen. 2017.
spatial dependencies between pixels could further reduce low-level Bringing portraits to life. ACM Trans. Graphics (2017).
Thabo Beeler and Derek Bradley. 2014. Rigid stabilization of facial expressions. ACM
artifacts. Another limitation is that in the case of large head motion, Trans. Graphics (2014).
more artifacts tend to appear: since we attempt to create animations Matthew Brand. 1999. Voice puppetry. In Proc. SIGGRAPH.
Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d &
from a single input image, large rotations/translations require suffi- 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proc.
cient extrapolation to unseen parts of the head (e.g., neck, shoulders, ICCV.
hair), which are more challenging for our current image translation Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip
movements generation at a glance. In Proc. ECCV.
net to hallucinate especially for natural head images. Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-
Our method heavily relies on the intermediate sparse landmark modal talking face generation with dynamic pixel-wise loss. In Proc. CVPR.
representation to guide the final video output. The representation Joon Son Chung, Amir Jamaludin, and Andrew Zisserman. 2017. You said that?. In Proc.
BMVC.
has the advantage of being low-dimensional and handling a variety Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep
of faces beyond human-looking ones. On the other hand, landmarks Speaker Recognition. In Proc. INTERSPEECH.
Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, Anurag Ranjan, and Michael J Black.
serve mostly as coarse proxies for modeling heads, thus, for large 2019. Capture, Learning, and Synthesis of 3D Speaking Styles. In Proc. CVPR.
motion, they sometimes cause face distortions especially for natural Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:
head images (see also our supplementary video, last clip). In the par- Pre-training of deep bidirectional transformers for language understanding.
arXiv:1810.04805 (2018).
ticular case of human face animation, an alternative representation Pif Edwards, Chris Landreth, Eugene Fiume, and Karan Singh. 2016. JALI: an animator-
could be denser landmarks or parameters of a morphable model centric viseme model for expressive lip synchronization. ACM Trans. Graphics
that may result in more accurate face reconstructions. A particular (2016).
Paul Ekman and Wallace V Friesen. 1978. Facial action coding system: a technique for
challenge here would be to train such models in the zero-shot learn- the measurement of facial movement. (1978).
ing regime, where the input portrait has not been observed during Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Duan. 2018. Generating
talking face landmarks from speech. In Proc. LVA/ICA.
training; current methods seem to require additional fine-tuning on Sefik Emre Eskimez, Ross K Maddox, Chenliang Xu, and Zhiyao Duan. 2019. Noise-
target faces [Thies et al. 2020]. Resilient Training Method for Face Landmark Generation From Speech. IEEE/ACM
Finally, our current algorithm focuses on a fully automatic pipeline. Trans. Audio, Speech, and Language Processing (2019).
Patrick Esser, Ekaterina Sutter, and Björn Ommer. 2018. A variational u-net for condi-
It remains an open challenge to incorporate user interaction within tional appearance and shape generation. In Proc. CVPR.
a human-in-the-loop approach. An important question is how an an- Gary Faigin. 2012. The artist’s complete guide to facial expression. Watson-Guptill.
imator could edit landmarks in certain frames and propagate those Jakub Fišer, Ondřej Jamriška, David Simons, Eli Shechtman, Jingwan Lu, Paul Asente,
Michal Lukáč, and Daniel Sýkora. 2017. Example-Based Synthesis of Stylized Facial
edits to the rest of the video. We look forward to future endeavors Animations. ACM Trans. Graphics (2017).
on high-quality expressive talking-head animations with intuitive Chuang Gan, Deng Huang, Peihao Chen, Joshua B Tenenbaum, and Antonio Torralba.
2020a. Foley Music: Learning to Generate Music from Videos. ECCV (2020).
controls. Chuang Gan, Deng Huang, Hang Zhao, Joshua B Tenenbaum, and Antonio Torralba.
2020b. Music Gesture for Visual Sound Separation. In CVPR. 10478–10487.
Pablo Garrido, Levi Valgaerts, Hamid Sarmadi, Ingmar Steiner, Kiran Varanasi, Patrick
7 ETHICAL CONSIDERATIONS Perez, and Christian Theobalt. 2015. Vdub: Modifying face video of actors for
“Deepfake videos” are becoming more prevalent in our everyday life. plausible visual alignment to a dubbed audio track. In Computer graphics forum.
Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, and Jitendra
The general public might still think that talking head videos are hard Malik. 2019. Learning Individual Styles of Conversational Gesture. In Proc. CVPR.
or impossible to generate synthetically. As a result, algorithms for Alex Graves and Navdeep Jaitly. 2014. Towards end-to-end speech recognition with
recurrent neural networks. In Proc. ICML.
talking head generation can be misused to spread misinformation David Greenwood, Iain Matthews, and Stephen Laycock. 2018. Joint learning of facial
or for other malicious acts. We hope our code will help people expression and head pose from speech. Interspeech.
understand that generating such videos is entirely feasible. Our Xintong Han, Zuxuan Wu, Weilin Huang, Matthew R Scott, and Larry S Davis. 2019.
FiNet: Compatible and Diverse Fashion Image Inpainting. In Proc. ICCV.
main intention is to spread awareness and demystify this technology. Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image
Our code includes a watermark to the generated videos making it translation with conditional adversarial networks. In Proc. CVPR.
clear that they are synthetic. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time
style transfer and super-resolution. In Proc. ECCV.
Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-
ACKNOWLEDGMENTS driven facial animation by joint end-to-end learning of pose and emotion. ACM
Trans. Graphics (2017).
We would like to thank Timothy Langlois for the narration, and Hyeongwoo Kim, Mohamed Elgharib, Michael Zollhöfer, Hans-Peter Seidel, Thabo
Beeler, Christian Richardt, and Christian Theobalt. 2019. Neural style-preserving
Kaizhi Qian for the help with the voice conversion module. We visual dubbing. ACM Trans. Graphics (2019).
thank Daichi Ito for sharing the caricature image and Dave Werner Yilong Liu, Feng Xu, Jinxiang Chai, Xin Tong, Lijuan Wang, and Qiang Huo. 2015.
for Wilk, the gruff but ultimately lovable puppet. We also thank the Video-audio driven real-time facial animation. ACM Trans. Graphics (2015).
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE.
anonymous reviewers for their constructive comments and sugges- JMLR (2008).
tions. This research is partially funded by NSF (EAGER-1942069) and Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen
a gift from Adobe. Our experiments were performed in the UMass Paul Smolley. 2017. Least squares generative adversarial networks. In Proc. ICCV.
Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature
GPU cluster obtained under the Collaborative Fund managed by the (1976).
MassTech Collaborative. Notevibes. 2020. Text to Speech converter. https://notevibes.com/.
Hai Xuan Pham, Yuting Wang, and Vladimir Pavlovic. 2018. End-to-end learning for
3d facial animation from speech. In Proc. ICMI.
REFERENCES Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc
Moreno-Noguer. 2018. Ganimation: Anatomically-aware facial animation from a
Shruti Agarwal, Hany Farid, Yuming Gu, Mingming He, Koki Nagano, and Hao Li. 2019. single image. In Proc. ECCV.
Protecting World Leaders Against Deep Fakes. In Proc. CVPRW.
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.
MakeItTalk: Speaker-Aware Talking-Head Animation • 221:15
Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. Table 4. Generator architecture for synthesizing natural face images.
2019. AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss. In
Proc. ICML. 5210–5219. Landmark Representation Y𝑡
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional 256 × 256 Input Image Q
networks for biomedical image segmentation. In Proc. MICCAI.
Aleksandr Segal, Dirk Haehnel, and Sebastian Thrun. 2009. Generalized-icp.. In Proc. 128 × 128 ResBlock down (3 + 3) → 64
Robotics: science and systems. 64 × 64 ResBlock down 64 → 128
Soumyadip Sengupta, Vivek Jayaram, Brian Curless, Steve Seitz, and Ira Kemelmacher-
Shlizerman. 2020. Background Matting: The World is Your Green Screen. In Proc.
32 × 32 ResBlock down 128 → 256
CVPR. 16 × 16 ResBlock down 256 → 512
Takaaki Shiratori, Atsushi Nakazawa, and Katsushi Ikeuchi. 2006. Dancing-to-music 8×8 ResBlock down 512 → 512
character animation. In Computer Graphics Forum.
Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 4×4 ResBlock down 512 → 512
2019. First Order Motion Model for Image Animation. In Conference on Neural 8×8 ResBlock up 512 → 512
Information Processing Systems (NeurIPS).
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for 16 × 16 Skip + ResBlock up (512 + 512) → 512
large-scale image recognition. In Proc. ICLR. 32 × 32 Skip + ResBlock up (512 + 512) → 256
Yang Song, Jingwen Zhu, Dawei Li, Andy Wang, and Hairong Qi. 2019. Talking Face
Generation by Conditional Recurrent Adversarial Network. In Proc. IJCAI.
64 × 64 Skip + ResBlock up (256 + 256) → 128
Olga Sorkine. 2006. Differential representations for mesh processing. In Computer 128 × 128 Skip + ResBlock up (128 + 128) → 64
Graphics Forum. 256 × 256 Skip + ResBlock up (64 + 64) → 3
Yannis Stylianou. 2009. Voice transformation: a survey. In Proc. ICASSP.
Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Syn- 256 × 256 Tanh
thesizing obama: learning lip sync from audio. ACM Trans. Graphics (2017).
Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia
Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for B IMAGE-TO-IMAGE TRANSLATION NETWORK
generalized speech animation. ACM Trans. Graphics (2017).
Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias The layers of the network architecture used to generate natural
Nießner. 2020. Neural Voice Puppetry: Audio-driven Facial Reenactment. In Proc. human face images are listed in Table 4. In this table, the left col-
CVPR, to appear. umn indicates the spatial resolution of the feature map output. Res-
Justus Thies, Michael Zollhofer, Marc Stamminger, Christian Theobalt, and Matthias
Nießner. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In Block down means a 2-strided convolutional layer with 3×3 kernel
Proc. CVPR. followed by two residual blocks, ResBlock up means a nearest-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proc.
neighbor upsampling with a scale of 2, followed by a 3 × 3 convolu-
NeurIPS. tional layer and then two residual blocks, and Skip means a skip
Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, et al. 2016. Superseded-cstr connection concatenating the feature maps of an encoding layer
vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. (2016).
Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. Realistic Speech- and decoding layer with the same spatial resolution.
Driven Facial Animation with GANs. IJCV (2019).
Marilyn A Walker, Janet E Cahn, and Stephen J Whittaker. 1997. Improvising linguistic
style: Social and affective bases for agent personality. In Proc. IAA.
Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-
end loss for speaker verification. In Proc. ICASSP.
Chung-Yi Weng, Brian Curless, and Ira Kemelmacher-Shlizerman. 2019. Photo wake-up:
3d character animation from a single photo. In Proc. CVPR.
Jordan Yaniv, Yael Newman, and Ariel Shamir. 2019. The face of art: landmark detection
and geometric style in portraits. ACM Trans. Graphics (2019).
Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-
Shot Adversarial Learning of Realistic Neural Talking Head Models. In Proc. ICCV.
Hang Zhou, Yu Liu, Ziwei Liu, Ping Luo, and Xiaogang Wang. 2019. Talking face
generation by adversarially disentangled audio-visual representation. In Proc. AAAI.
Yang Zhou, Zhan Xu, Chris Landreth, Evangelos Kalogerakis, Subhransu Maji, and
Karan Singh. 2018. Visemenet: Audio-driven animator-centric speech animation.
ACM Trans. Graphics (2018).
ACM Trans. Graph., Vol. 39, No. 6, Article 221. Publication date: December 2020.