3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

3D human pose estimation in video with temporal convolutions and

semi-supervised training

Dario Pavllo∗ Christoph Feichtenhofer David Grangier∗ Michael Auli


ETH Zürich Facebook AI Research Google Brain Facebook AI Research
arXiv:1811.11742v2 [cs.CV] 29 Mar 2019

Abstract

In this work, we demonstrate that 3D poses in video


can be effectively estimated with a fully convolutional
model based on dilated temporal convolutions over 2D key-
points. We also introduce back-projection, a simple and
effective semi-supervised training method that leverages
unlabeled video data. We start with predicted 2D key-
points for unlabeled video, then estimate 3D poses and
finally back-project to the input 2D keypoints. In the
supervised setting, our fully-convolutional model outper-
forms the previous best result from the literature by 6 mm Figure 1: Our temporal convolutional model takes 2D key-
mean per-joint position error on Human3.6M, correspond- point sequences (bottom) as input and generates 3D pose
ing to an error reduction of 11%, and the model also estimates as output (top). We employ dilated temporal con-
shows significant improvements on HumanEva-I. More- volutions to capture long-term information.
over, experiments with back-projection show that it comfort-
ably outperforms previous state-of-the-art results in semi-
supervised settings where labeled data is scarce. Code In this paper, we present a fully convolutional architec-
and models are available at https://github.com/ ture that performs temporal convolutions over 2D keypoints
facebookresearch/VideoPose3D for accurate 3D pose prediction in video (see Figure 1). Our
approach is compatible with any 2D keypoint detector and
can effectively handle large contexts via dilated convolu-
1. Introduction tions. Compared to approaches relying on RNNs [16, 27],
it provides higher accuracy, simplicity, as well as efficiency,
Our work focuses on 3D human pose estimation in video. both in terms of computational complexity as well as the
We build on the approach of state-of-the-art methods which number of parameters (§3).
formulate the problem as 2D keypoint detection followed by
Equipped with a highly accurate and efficient architec-
3D pose estimation [41, 52, 34, 50, 10, 40, 56, 33]. While
ture, we turn to settings where labeled training data is scarce
splitting up the problem arguably reduces the difficulty of
and introduce a new scheme to leverage unlabeled video
the task, it is inherently ambiguous as multiple 3D poses
data for semi-supervised training. Low resource settings are
can map to the same 2D keypoints. Previous work tack-
particularly challenging for neural network models which
led this ambiguity by modeling temporal information with
require large amounts of labeled training data and collect-
recurrent neural networks [16, 27]. On the other hand, con-
ing labels for 3D human pose estimation requires an ex-
volutional networks have been very successful in modeling
pensive motion capture setup as well as lengthy recording
temporal information in tasks that were traditionally tack-
sessions. Our method is inspired by cycle consistency in
led with RNNs, such as neural machine translation [11],
unsupervised machine translation, where round-trip transla-
language modeling [7], speech generation [55], and speech
tion into an intermediate language and back into the original
recognition [6]. Convolutional models enable parallel pro-
language should be close to the identity function [46, 26, 9].
cessing of multiple frames which is not possible with recur-
Specifically, we predict 2D keypoints for an unlabeled video
rent networks.
with an off the shelf 2D keypoint detector, predict 3D poses,
∗ Work done while at Facebook AI Research. and then map these back to 2D space (§4).
In summary, this paper provides two main contributions. images [30, 24]. The most successful approaches, however,
First, we present a simple and efficient approach for 3D learn from 2D keypoint trajectories. Our work falls under
human pose estimation in video based on dilated temporal this category.
convolutions on 2D keypoint trajectories. We show that our Recently, LSTM sequence-to-sequence learning models
model is more efficient than RNN-based models at the same have been proposed, which encode a sequence of 2D poses
level of accuracy, both in terms of computational complex- from a video into a fixed-size vector that is then decoded
ity and the number of model parameters. into a sequence of 3D poses [16]. However, both the in-
Second, we introduce a semi-supervised approach which put and output sequences have the same length and a deter-
exploits unlabeled video, and is effective when labeled ministic transformation of 2D poses is a much more natu-
data is scarce. Compared to previous semi-supervised ral choice. Our experiments with seq2seq models showed
approaches, we only require camera intrinsic parameters that output poses tend to drift over lengthy sequences. [16]
rather than ground-truth 2D annotations or multi-view im- tackles this problem by re-initializing the encoder every 5
agery with extrinsic camera parameters. frames, at the expense of temporal consistency. There has
In comparison to the state of the art our approach out- also been work on RNN approaches which consider priors
performs the previously best performing methods in both on body part connectivity [27].
supervised and semi-supervised settings. Our supervised Semi-supervised training. There has been work on mul-
model performs better than other models even if these ex- titask networks [3] for joint 2D and 3D pose estimation
ploit extra labeled data for training. [36, 33] as well as action recognition [33]. Some works
transfer the features learned for 2D pose estimation to the
2. Related work 3D task [35]. Unlabeled multi-view recordings have been
Before the success of deep learning, most approaches used for pre-training representations for 3D pose estima-
to 3D pose estimation were based on feature engineer- tion [45], but these recordings are not readily available
ing and assumptions about skeletons and joint mobility in unsupervised settings. Generative adversarial networks
[48, 42, 20, 18]. The first neural methods with convolutional (GAN) can discriminate realistic poses from unrealistic
neural networks (CNN) focused on end-to-end reconstruc- ones in a second dataset where only 2D annotations are
tion [28, 53, 51, 41] by directly estimating 3D poses from available [56], thus providing a useful form of regulariza-
RGB images without intermediate supervision. tion. [54] use GANs to learn from unpaired 2D/3D datasets
Two-step pose estimation. A new family of 3D pose es- and include a 2D projection consistency term. Similarly,
timators builds on top of 2D pose estimators by first pre- [8] discriminate generated 3D poses after randomly project-
dicting 2D joint positions in image space (keypoints) which ing them to 2D. [40] propose a weakly-supervised approach
are subsequently lifted to 3D [21, 34, 41, 52, 4, 16]. These based on ordinal depth annotations which leverages a 2D
approaches outperform the end-to-end counterparts, since pose dataset augmented with depth comparisons, e.g. “the
they benefit from intermediate supervision. We follow this left leg is behind the right leg”.
approach. Recent work shows that predicting 3D poses 3D shape recovery. While this paper and the discussed
is relatively straightforward given ground-truth 2D key- related work focus on reconstructing accurate 3D poses, a
points, and that the difficulty lies in predicting accurate 2D parallel line of research aims at recovering full 3D shapes
poses [34]. Early approaches [21, 4] simply perform a k- of people from images [1, 23]. These approaches are typi-
nearest neighbour search for a predicted set of 2D keypoints cally based on parameterized 3D meshes and give less im-
over a large set of 2D keypoints for which the 3D pose portance to pose accuracy.
is available and then simply output the corresponding 3D Our work. Compared to [41, 40], we do not use heatmaps
pose. Some approaches leverage both image features and and instead describe poses with detected keypoint coordi-
2D ground-truth poses [39, 41, 52]. Alternatively, the 3D nates. This allows the use of efficient 1D convolution over
pose can be predicted from a given set of 2D keypoints by coordinate time series, instead of 2D convolutions over in-
simply predicting their depth [58]. Some works enforce pri- dividual heatmaps (or 3D convolutions over heatmap se-
ors about bone lengths and projection consistency with the quences). Our approach also makes computational com-
2D ground truth [2]. plexity independent of keypoint spatial resolution. Our
Video pose estimation. Most previous work operates in models can reach high accuracy with fewer parameters
a single-frame setting but recently there have been efforts and allow for faster training and inference. Compared to
in exploiting temporal information from video to produce the single-frame baseline proposed by [34] and the LSTM
more robust predictions and to be less sensitive to noise. model by [16], we exploit temporal information by perform-
[53] infer 3D poses from the HoG features (histograms of ing 1D convolutions over the time dimension, and we pro-
oriented gradients) of spatio-temporal volumes. LSTMs pose several optimizations that result in lower reconstruc-
have been used to refine 3D poses predicted from single tion error. Unlike [16], we learn a deterministic mapping
(241, 1024) (235, 1024) (235, 1024) (217, 1024) (217, 1024) (163, 1024) (163, 1024) (1, 1024)
Slice Slice Slice Slice

1024, 3d27, 1024

1024, 3d81, 1024


1024, 3d3, 1024

1024, 1d1, 1024

1024, 3d9, 1024

1024, 1d1, 1024

1024, 1d1, 1024

1024, 1d1, 1024


BatchNorm 1D

BatchNorm 1D

BatchNorm 1D

BatchNorm 1D

BatchNorm 1D

BatchNorm 1D

BatchNorm 1D

BatchNorm 1D

BatchNorm 1D
2J, 3d1, 1024

1024, 1d1, 3J
Dropout 0.25

Dropout 0.25

Dropout 0.25

Dropout 0.25

Dropout 0.25

Dropout 0.25

Dropout 0.25

Dropout 0.25

Dropout 0.25
(243, 34) (1, 51)
ReLU

ReLU

ReLU

ReLU

ReLU

ReLU

ReLU

ReLU

ReLU
Figure 2: An instantiation of our fully-convolutional 3D pose estimation architecture. The input consists of 2D keypoints for
a recpetive field of 243 frames (B = 4 blocks) with J = 17 joints. Convolutional layers are in green where 2J, 3d1,
1024 denotes 2 · J input channels, kernels of size 3 with dilation 1, and 1024 output channels. We also show tensor sizes
in parentheses for a sample 1-frame prediction, where (243, 34) denotes 243 frames and 34 channels. Due to valid
convolutions, we slice the residuals (left and right, symmetrically) to match the shape of subsequent tensors.

instead of a seq2seq model. Finally, contrary to most of in the input sequence using both past and future data to ex-
the two-step models mentioned in this section (which use ploit temporal information. To evaluate real-time scenarios,
the popular stacked hourglass network [38] for 2D keypoint we also experiment with causal convolutions, i.e. convolu-
detection), we show that Mask R-CNN [12] and cascaded tions that only have access to past frames. Appendix A.1
pyramid network (CPN) [5] detections are more robust for illustrates dilated convolutions and causal convolutions.
3D human pose estimation. Convolutional image models typically apply zero-
padding to obtain as many outputs as inputs. Early experi-
3. Temporal dilated convolutional model ments however showed better results when performing only
unpadded convolutions while padding the input sequence
Our model is a fully convolutional architecture with
with replica of the boundary frames to the left and the right
residual connections that takes a sequence of 2D poses as
(see Appendix A.5, Figure 9a for an illustration).
input and transforms them through temporal convolutions.
Figure 2 shows an instantiation of our architecture for a
Convolutional models enable parallelization over both the
receptive field size of 243 frames with B = 4 blocks. For
batch and the time dimension while RNNs cannot be paral-
convolutional layers, we set W = 3 with C = 1024 output
lelized over time. In convolutional models, the path of the
channels and we use a dropout rate p = 0.25.
gradient between output and input has a fixed length regard-
less of the sequence length, which mitigates vanishing and
4. Semi-supervised approach
exploding gradients which affect RNNs. A convolutional
architecture also offers precise control over the temporal re- We introduce a semi-supervised training method to im-
ceptive field, which we found beneficial to model temporal prove accuracy in settings where the availability of labeled
dependencies for the task of 3D pose estimation. Moreover, 3D ground-truth pose data is limited. We leverage unla-
we employ dilated convolutions [15] to model long-term de- beled video in combination with an off the shelf 2D key-
pendencies while at the same time maintaining efficiency. point detector to extend the supervised loss function with
Architectures with dilated convolutions have been success- a back-projection loss term. We solve an auto-encoding
ful for audio generation [55], semantic segmentation [57] problem on unlabeled data: the encoder (pose estimator)
and machine translation [22]. performs 3D pose estimation from 2D joint coordinates and
The input layer takes the concatenated (x, y) coordi- the decoder (projection layer) projects the 3D pose back to
nates of the J joints for each frame and applies a tempo- 2D joint coordinates. Training penalizes when the 2D joint
ral convolution with kernel size W and C output channels. coordinates from the decoder are far from the original input.
This is followed by B ResNet-style blocks which are sur- Figure 3 represents our method which combines our
rounded by a skip-connection [13]. Each block first per- supervised component with our unsupervised component
forms a 1D convolution with kernel size W and dilation fac- which acts as a regularizer. The two objectives are opti-
tor D = W B , followed by a convolution with kernel size mized jointly, with the labeled data occupying the first half
1. Convolutions (except the very last layer) are followed of a batch, and the unlabeled data occupying the second
by batch normalization [17], rectified linear units [37], and half. For the labeled data we use the ground truth 3D poses
dropout [49]. Each block increases the receptive field expo- as target and train a supervised loss. The unlabeled data is
nentially by a factor of W , while the number of parameters used to implement an autoencoder loss where the predicted
increases only linearly. The filter hyperparameters, W and 3D poses are projected back to 2D and then checked for
D, are set so that the receptive field for any output frame consistency with the input.
forms a tree that covers all input frames (see §1). Finally, the Trajectory model. Due to the perspective projection, the
last layer outputs a prediction of the 3D poses for all frames 2D pose on the screen depends both on the trajectory (i.e.
Global positions Discussion. Our method only requires the camera intrinsic
Trajectory model WMPJPE loss Ground parameters, which are often available for commercial cam-
truth eras.1 The approach is not tied to any specific network ar-
Labeled Pose model MPJPE loss
chitecture and can be applied to any 3D pose detector which
Ground
2D Poses truth
takes 2D keypoints as inputs. In our experiments we use
3D Poses
the architecture described in §3 to map 2D poses to 3D. To
project 3D poses to 2D, we use a simple projection layer
Bone length
Global positions
L2 loss
which considers linear parameters (focal length, principal
Trajectory model point) as well as non-linear lens distortion coefficients (tan-
gential and radial). We found the lens distortions of the
Unlabeled Pose model Projection cameras used in Human3.6M have negligible impact on the
2D Poses pose estimation metric, but we include these terms nonethe-
3D Poses less because they always provide a more accurate modeling
of the real camera projection.
2D MPJPE loss

5. Experimental setup
Figure 3: Semi-supervised training with a 3D pose model
that takes a sequence of possibly predicted 2D poses as in- 5.1. Datasets and Evaluation
put. We regress the 3D trajectory of the person and add a
We evaluate on two motion capture datasets, Hu-
soft-constraint to match the mean bone lengths of the unla-
man3.6M [20, 19] and HumanEva-I [47]. Human3.6M con-
beled predictions to the labeled ones. Everything is trained
tains 3.6 million video frames for 11 subjects, of which
jointly. WMPJPE stands for “Weighted MPJPE”.
seven are annotated with 3D poses. Each subject performs
15 actions that are recorded using four synchronized cam-
the global position of the human root joint) and the 3D pose eras at 50 Hz. Following previous work [41, 52, 34, 50, 10,
(the position of all joints with respect to the root joint). 40, 56, 33], we adopt a 17-joint skeleton, train on five sub-
Without the global position, the subject would always be re- jects (S1, S5, S6, S7, S8), and test on two subjects (S9 and
projected at the center of the screen with a fixed scale. We S11). We train a single model for all actions.
therefore also regress the 3D trajectory of the person, so that HumanEva-I is a much smaller dataset, with three sub-
the back-projection to 2D can be performed correctly. To jects recorded from three camera views at 60 Hz. Follow-
this end, we optimize a second network which regresses the ing [34, 16], we evaluate on three actions (Walk, Jog, Box)
global trajectory in camera space. The latter is added to the by training a different model for each action (single action
pose before projecting it back to 2D. The two networks have – SA). We also report results when training one model for
the same architecture but do not share any weights as we ob- all actions (multi action – MA), as in [41, 27]. We adopt a
served that they affect each other negatively when trained 15-joint skeleton and use the provided train/test split.
in a multi-task fashion. As it becomes increasingly difficult In our experiments, we consider three evaluation pro-
to regress a precise trajectory if the subject is further away tocols: Protocol 1 is the mean per-joint position error
from the camera, we optimize a weighted mean per-joint (MPJPE) in millimeters which is the mean Euclidean dis-
position error (WMPJPE) loss function for the trajectory: tance between predicted joint positions and ground-truth
joint positions and follows [29, 53, 59, 34, 41]. Protocol
1 2 reports the error after alignment with the ground truth
E= kf (x) − yk (1)
yz in translation, rotation, and scale (P-MPJPE) [34, 50, 10,
40, 56, 16]. Protocol 3 aligns predicted poses with the
that is, we weight each sample using the inverse of the
ground-truth only in scale (N-MPJPE) following [45] for
ground-truth depth (yz ) in camera space. Regressing a pre-
semi-supervised experiments.
cise trajectory for far subjects is also unnecessary for our
purposes, since the corresponding 2D keypoints tend to con- 5.2. Implementation details for 2D pose estimation
centrate around a small area.
Bone length L2 loss. We would like to incentivize the pre- Most previous work [34, 58, 52] extracts the subject
diction of plausible 3D poses instead of just copying the in- from ground-truth bounding boxes and then applies the
put. To do so, we found it effective to add a soft constraint stacked hourglass detector to predict the 2D keypoint lo-
to approximately match the mean bone lengths of the sub- cations within the ground-truth bounding box [38]. Our ap-
jects in the unlabeled batch to the subjects of the labeled proach (§3 and §4) does not depend on any particular 2D
batch (“Bone length L2 loss” in Figure 3). This term plays 1 Even low-end devices typically embed this information in the EXIF

an important role in self-supervision, as we show in §6.2. metadata of images or videos.


keypoint detector. We therefore investigate several 2D de- larger than one, are sensitive to the correlation of samples
tectors that do not rely on ground-truth boxes which enables in pose sequences (cf. §3). This results in biased statistics
the use of our setup in the wild. In addition to the stacked for batch normalization which assumes independent sam-
hourglass detector, we investigate Mask R-CNN [12] with a ples [17]. In preliminary experiments, we found that pre-
ResNet-101-FPN [31] backbone, using its reference imple- dicting a large number of adjacent frames during training
mentation in Detectron, as well as cascaded pyramid net- yields results that are worse than a model exploiting no tem-
work (CPN) [5] which represents an extension of FPN. The poral information (which has well-randomized samples in
CPN implementation requires bounding boxes to be pro- the batch). We reduce correlation in the training samples by
vided externally (we use Mask R-CNN boxes for this case). choosing training clips from different video segments. The
For both Mask R-CNN and CPN, we start with pre- clip set size is set to the width of the receptive field of our
trained models on COCO [32] and fine-tune the detectors architecture so that the model predicts a single 3D pose per
on 2D projections of Human3.6M, since the keypoints in training clip. This is important for generalization and we
COCO differ from Human3.6M [20]. In our ablations, we analyze it in detail in Appendix A.5.
also experiment with directly applying our 3D pose estima- We can greatly optimize this single frame setting by
tor to pretrained 2D COCO keypoints for estimating the 3D replacing dilated convolutions with strided convolutions
joints of Human3.6M. where the stride is set to be the dilation factor (see Ap-
For Mask R-CNN, we adopt a ResNet-101 backbone pendix A.6). This avoids computing states that are never
trained with the “stretched 1x” schedule [12].2 When fine- used and we apply this optimization only during training.
tuning the model on Human3.6M, we reinitialize the last At inference, we can process entire sequences and reuse in-
layer of the keypoint network, as well as the deconv layers termediate states of other 3D frames for faster inference.
that regress the heatmaps to learn a new set of keypoints. This is possible because our model does not use any form
We train on 4 GPUs with a step-wise decaying learning rate: of pooling over the time dimension. To avoid losing frames
1e-3 for 60k iterations, then 1e-4 for 10k iterations, and 1e-5 to valid convolutions, we pad by replication, but only at the
for 10k iterations. At inference, we apply a softmax over the input boundaries of a sequence (Appendix A.5, Figure 9a
the heatmaps and extract the expected value of the resulting shows an illustration).
2D distribution (soft-argmax). This results in smoother and We observed that the default hyperparameters of batch
more precise predictions than hard-argmax [33]. normalization lead to large fluctuations of the test error (±
For CPN, we use a ResNet-50 backbone with a 384×288 1 mm) as well as to fluctuations in the running estimates
resolution. To fine-tune, we re-initialize the final layers for inference. To achieve more stable running statistics, we
of both GlobalNet and RefineNet (convolution weights and use a schedule for the batch-normalization momentum β:
batch normalization statistics). Next, we train on one GPU we start from β = 0.1, and decay it exponentially so that it
with batches of 32 images and with a step-wise decaying reaches β = 0.001 in the last epoch.
learning rate: 5e-5 (1/10th of the initial value) for 6k iter- Finally, we perform horizontal flip augmentation at train
ations, then 5e-6 for 4k iterations, and finally 5e-7 for 2k and test time. We show the effect of this in Appendix A.4.
iterations. We keep batch normalization enabled while fine- For HumanEva, we use N = 128, α = 0.996, and train
tuning. We train with ground-truth bounding boxes and test for 1000 epochs using a receptive field of 27 frames. Some
using the bounding boxes predicted by the fine-tuned Mask frames in HumanEva are corrupted by sensor dropout and
R-CNN model. we split the corrupted videos into valid contiguous chunks
and treat them as independent videos.
5.3. Implementation details for 3D pose estimation
For consistency with other work [34, 29, 53, 59, 34, 41], 6. Results
we train and evaluate on 3D poses in camera space by rotat-
ing and translating the ground-truth poses according to the 6.1. Temporal dilated convolutional model
camera transformation, and not using the global trajectory Table 1 shows results for our convolutional model with
(except for the semi-supervised setting, §4). B = 4 blocks and a receptive field of 243 input frames for
As optimizer we use Amsgrad [43] and train for 80 both evaluation protocols (§5). The model has lower aver-
epochs. For Human3.6M, we adopt an exponentially de- age error than all other approaches under both protocols,
caying learning rate schedule, starting from η = 0.001 with and does not rely on additional data such as many other
a shrink factor α = 0.95 applied each epoch. approaches (+). Under protocol 1 (Table 1a), our model
All temporal models, i.e. models with receptive fields outperforms the previous best result [27] by 6 mm on av-
2 https://github.com/facebookresearch/Detectron/ erage, corresponding to an 11% error reduction. Notably,
blob/master/configs/12_2017_baselines/e2e_ [27] uses ground-truth boxes whereas our model does not.
keypoint_rcnn_R-101-FPN_s1x.yaml The model clearly takes advantage of temporal infor-
Dir. Disc. Eat Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg
Pavlakos et al. [41] CVPR’17 (∗) 67.4 71.9 66.7 69.1 72.0 77.0 65.0 68.3 83.7 96.5 71.7 65.8 74.9 59.1 63.2 71.9
Tekin et al. [52] ICCV’17 54.2 61.4 60.2 61.2 79.4 78.3 63.1 81.6 70.1 107.3 69.3 70.3 74.3 51.8 63.2 69.7
Martinez et al. [34] ICCV’17 (∗) 51.8 56.2 58.1 59.0 69.5 78.4 55.2 58.1 74.0 94.6 62.3 59.1 65.1 49.5 52.4 62.9
Sun et al. [50] ICCV’17 (+) 52.8 54.8 54.2 54.3 61.8 67.2 53.1 53.6 71.7 86.7 61.5 53.4 61.6 47.1 53.4 59.1
Fang et al. [10] AAAI’18 50.1 54.3 57.0 57.1 66.6 73.3 53.4 55.7 72.8 88.6 60.3 57.7 62.7 47.5 50.6 60.4
Pavlakos et al. [40] CVPR’18 (+) 48.5 54.4 54.4 52.0 59.4 65.3 49.9 52.9 65.8 71.1 56.6 52.9 60.9 44.7 47.8 56.2
Yang et al. [56] CVPR’18 (+) 51.5 58.9 50.4 57.0 62.1 65.4 49.8 52.7 69.2 85.2 57.4 58.4 43.6 60.1 47.7 58.6
Luvizon et al. [33] CVPR’18 (∗)(+) 49.2 51.6 47.6 50.5 51.8 60.3 48.5 51.7 61.5 70.9 53.7 48.9 57.9 44.4 48.9 53.2
Hossain & Little [16] ECCV’18 (†)(∗) 48.4 50.7 57.2 55.2 63.1 72.6 53.0 51.7 66.1 80.9 59.0 57.3 62.4 46.6 49.6 58.3
Lee et al. [27] ECCV’18 (†)(∗) 40.2 49.2 47.8 52.6 50.1 75.0 50.2 43.0 55.8 73.9 54.1 55.6 58.2 43.3 43.3 52.8
Ours, single-frame 47.1 50.6 49.0 51.8 53.6 61.4 49.4 47.4 59.3 67.4 52.4 49.5 55.3 39.5 42.7 51.8
Ours, 243 frames, causal conv. (†) 45.9 48.5 44.3 47.8 51.9 57.8 46.2 45.6 59.9 68.5 50.6 46.4 51.0 34.5 35.4 49.0
Ours, 243 frames, full conv. (†) 45.2 46.7 43.3 45.6 48.1 55.1 44.6 44.3 57.3 65.8 47.1 44.0 49.0 32.8 33.9 46.8
Ours, 243 frames, full conv. (†)(∗) 45.1 47.4 42.0 46.0 49.1 56.7 44.5 44.4 57.2 66.1 47.5 44.8 49.2 32.6 34.0 47.1

(a) Protocol 1: reconstruction error (MPJPE).

Dir. Disc. Eat Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg
Martinez et al. [34] ICCV’17 (∗) 39.5 43.2 46.4 47.0 51.0 56.0 41.4 40.6 56.5 69.4 49.2 45.0 49.5 38.0 43.1 47.7
Sun et al. [50] ICCV’17 (+) 42.1 44.3 45.0 45.4 51.5 53.0 43.2 41.3 59.3 73.3 51.0 44.0 48.0 38.3 44.8 48.3
Fang et al. [10] AAAI’18 38.2 41.7 43.7 44.9 48.5 55.3 40.2 38.2 54.5 64.4 47.2 44.3 47.3 36.7 41.7 45.7
Pavlakos et al. [40] CVPR’18 (+) 34.7 39.8 41.8 38.6 42.5 47.5 38.0 36.6 50.7 56.8 42.6 39.6 43.9 32.1 36.5 41.8
Yang et al. [56] CVPR’18 (+) 26.9 30.9 36.3 39.9 43.9 47.4 28.8 29.4 36.9 58.4 41.5 30.5 29.5 42.5 32.2 37.7
Hossain & Little [16] ECCV’18 (†)(∗) 35.7 39.3 44.6 43.0 47.2 54.0 38.3 37.5 51.6 61.3 46.5 41.4 47.3 34.2 39.4 44.1
Ours, single-frame 36.0 38.7 38.0 41.7 40.1 45.9 37.1 35.4 46.8 53.4 41.4 36.9 43.1 30.3 34.8 40.0
Ours, 243 frames, causal conv. (†) 35.1 37.7 36.1 38.8 38.5 44.7 35.4 34.7 46.7 53.9 39.6 35.4 39.4 27.3 28.6 38.1
Ours, 243 frames, full conv. (†) 34.1 36.1 34.4 37.2 36.4 42.2 34.4 33.6 45.0 52.5 37.4 33.8 37.8 25.6 27.3 36.5
Ours, 243 frames, full conv. (†)(∗) 34.2 36.8 33.9 37.5 37.1 43.2 34.4 33.5 45.3 52.7 37.7 34.1 38.0 25.8 27.7 36.8

(b) Protocol 2: reconstruction error after rigid alignment with the ground truth (P-MPJPE), where available.

Table 1: Reconstruction error on Human3.6M. Legend: (†) uses temporal information. (∗) ground-truth bounding boxes.
(+) extra data – [50, 40, 56, 33] use 2D annotations from the MPII dataset, [40] uses additional data from the Leeds Sports
Pose (LSP) dataset as well as ordinal annotations. [50, 33] evaluate every 64th frame. [16] provided us with corrected results
over the originally published results 3 . Lower is better, best in bold, second best underlined.

mation as the error is about 5 mm higher on average for tector on the final result. Table 3 reports accuracy of our
protocol 1 compared to a single-frame baseline where we model with ground-truth 2D poses, hourglass-network pre-
set the width of all convolution kernels to W = 1. The dictions from [34] (both pre-trained on MPII and fine-tuned
gap is larger for highly dynamic actions, such as “Walk” on Human3.6M), Detectron and CPN (both pre-trained on
(6.7 mm) and “Walk Together” (8.8 mm). The performance COCO and fine-tuned on Human3.6M). Both Mask R-CNN
for a model with causal convolutions is about half way be- and CPN give better performance than the stacked hourglass
tween the single frame baseline and our model; causal con- network. The improvement is likely to be due to the higher
volutions enable online processing by predicting the 3D heatmap resolution, stronger feature combination (feature
pose for the rightmost input frame. Interestingly, ground- pyramid network [31, 44] for Mask R-CNN and RefineNet
truth bounding boxes result in similar performance to pre- for CPN), and the more diverse dataset on which they are
dicted bounding boxes with Mask R-CNN, which suggests pretrained, i.e. COCO [32]. When trained on 2D ground-
that predictions are almost-perfect in our single-subject sce- truth poses, our model improves the lower bound of [34] by
nario. Figure 4 shows examples of predicted poses includ- 8.3 mm, and the LSTM-based approach of Lee et al. [27]
ing the predicted 2D keypoints and we included a video by 1.2 mm for protocol 1. Therefore, our improvements are
illustration in the supplementary material (Appendix A.7) not merely due to a better 2D detector.
as well as at https://dariopavllo.github.io/ Absolute position errors do not measure the smoothness
VideoPose3D. of predictions over time, which is important for video. To
Next, we evaluate the impact of the 2D keypoint de- evaluate this, we measure joint velocity errors (MPJVE),
3 All subsequent results for [16] in this paper were computed by us using corresponding to the MPJPE of the first derivative of the
their public implementation. 3D pose sequences. Table 2 shows that our temporal model
Figure 4: Qualitative results for two videos. Top: video frames with 2D pose overlay. Bottom: 3D reconstruction.

Dir. Disc. Eat Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg
Single-frame 12.8 12.6 10.3 14.2 10.2 11.3 11.8 11.3 8.2 10.2 10.3 11.3 13.1 13.4 12.9 11.6
Temporal 3.0 3.1 2.2 3.4 2.3 2.7 2.7 3.1 2.1 2.9 2.3 2.4 3.7 3.1 2.8 2.8

Table 2: Velocity error over the 3D poses generated by a convolutional model that considers time and a single-frame baseline.

Method P1 P2 Method P1 P2 Model Parameters ≈ FLOPs MPJPE


Martinez et al. [34] (GT) 45.5 37.1 Ours (GT) 37.2 27.2 Hossain & Little [16] 16.96M 33.88M 41.6
Martinez et al. [34] (SH PT) 67.5 52.5 Ours (SH PT from [34]) 58.6 45.0
Ours 27f w/o dilation 29.53M 59.03M 41.1
Martinez et al. [34] (SH FT) 62.9 47.7 Ours (SH FT from [34]) 53.4 40.1
Hossain & Little [16] (GT) 41.6 31.7 Ours (D PT) 54.8 42.0 Ours 27f 8.56M 17.09M 40.6
Lee et al. [27] (GT) 38.4 – Ours (D FT) 51.6 40.3 Ours 81f 12.75M 25.48M 38.7
Ours (CPN PT) 52.1 40.1 Ours (CPN FT) 46.8 36.5 Ours 243f 16.95M 33.87M 37.8

Table 3: Effect of the 2D detector on the final result, un- Table 5: Computational complexity of various models un-
der Protocol 1 (P1) and Protocol 2 (P2) Legend: ground- der Protocol 1 trained on ground-truth 2D poses. Results
truth (GT), stacked hourglass (SH), Detectron (D), cascaded are without test-time augmentation.
pyramid network (CPN), pre-trained (PT), fine-tuned (FT).

Walk Jog Box


S1 S2 S3 S1 S2 S3 S1 S2 S3 point operations (FLOPs) to predict one frame at inference
Pavlakos et al. [41] (MA) 22.3 19.5 29.7 28.9 21.9 23.8 – – –
time (details in Appendix A.2). For the latter, we only con-
Martinez et al. [34] (SA) 19.7 17.4 46.8 26.9 18.2 18.6 – – – sider matrix multiplications and report the amortized cost
Pavlakos et al. [40] (+) (MA) 18.8 12.7 29.2 23.5 15.4 14.5 – – – over a hypothetical sequence of infinite length (to disregard
Lee et al. [27] (MA) 18.6 19.9 30.5 25.7 16.8 17.7 42.8 48.1 53.4
padding). MPJPE results are based on models trained on
Ours (SA) 14.5 10.5 47.3 21.9 13.4 13.9 24.3 34.9 32.1
Ours (MA) 13.9 10.2 46.6 20.9 13.1 13.8 23.8 33.7 32.0 ground-truth 2D poses without test-time augmentation. Our
model achieves a significantly lower error even when the
Table 4: Error on HumanEva-I under Protocol 2 for single- number of computations are halved. Our largest model with
action (SA) and multi-action (MA) models. Best in bold, receptive field of 243 frames has roughly the same com-
second best underlined. (+) uses extra data. The high error plexity as [16], but at 3.8 mm lower error. The table also
on “Walk” of S3 is due to corrupted mocap data. highlights the effectiveness of dilated convolutions which
increase complexity only logarithmically with respect to the
receptive field.
reduces the MPJVE of the single-frame baseline by 76% on Since our model is convolutional, it can be parallelized
average resulting in vastly smoother poses. both over the number of sequences as well as over the tem-
Table 4 shows results on HumanEva-I and that our model poral dimension. This contrasts to RNNs, which can only be
generalizes to smaller datasets; results are based on pre- parallelized over different sequences and are thus much less
trained Mask R-CNN 2D detections. Our models outper- efficient for small batch sizes. For inference, we measured
form the previous state-of-the-art. about 150k FPS on a single NVIDIA GP100 GPU over a
Finally, Table 5 compares the convolutional model to the single long sequence, i.e., batch size one, assuming that 2D
LSTM model of [16] in terms of complexity. We report the poses were already available. Speed is largely independent
number of model parameters and an estimate of the floating- of the batch size due to parallel temporal processing.
 the very small subsets, 1% and 5% of S1, we use 3 frames,
 
 and we use a single-frame model for 0.1% of S1 where only

 49 frames are available. We fine-tuned CPN on the labeled
103-3( PP

 
 data only and warm up training by iterating only over la-

 
 beled data for a few epochs (1 epoch for ≥ S1, 20 epochs

 for smaller subsets).
 
 
   Figure 5a shows that our semi-supervised approach be-
5KRGLQVXSHUYLVHG  
 5KRGLQVHPLVXSHUYLVHG  comes more effective as the amount of labeled data de-
 2XUVVXSHUYLVHG   
 
 2XUVVHPLVXSHUYLVHG  creases. For settings with less than 5K labeled frames, our
6 6 6 6 6 6 6 6 $OO
  N N N N N N N approach achieves improvements of about 9-10.4 mm N-
7UDLQLQJGDWD GRZQVDPSOHGWR)36
MPJPE over our supervised baseline. Our supervised base-
(a) Downsampled to 10 FPS under Protocol 3.
line is much stronger than [45] and outperforms all of their
 results by a large margin. Although [45] uses a single-frame


 model in all experiments, our findings still hold on 0.1% of

 S1 (where we also use a single-frame model).
 
 Figure 5b shows results for our method under the more
03-3( PP

 
 
common Protocol 1 for the non-downsampled version of the
    dataset (50 FPS). This setup is more appropriate for our ap-

  
 proach since it allows us to exploit full temporal informa-
  
tion in videos. Here we use a receptive field of 27 frames,
 2XUVVXSHUYLVHG
2XUVVHPLVXSHUYLVHG except in 1% of S1, where we use 9 frames, and 0.1% of

6 6 6 6 6 6 6 6 $OO S1, where we use one frame. Our semi-supervised approach
 N N N N N N N 0
7UDLQLQJGDWD gains up to 14.7 mm MPJPE over the supervised baseline.
(b) Full framerate under Protocol 1. Figure 5c switches the CPN 2D keypoints for ground-
 truth 2D poses to measure if we could perform better with


a better 2D keypoint detector. In this case, improvements
  can be up to 22.6 mm MPJPE (1% of S1) which con-
  firms that better 2D detections could improve performance.
03-3( PP

   The same graph shows that the bone length term is crucial
  for predicting valid poses, since it forces the model to re-
 

 
spect kinematic constraints (line “Ours semi-supervised GT



abl.”). Removing this term drastically decreases the effec-
2XUVVXSHUYLVHG*7   
 2XUVVHPLVXSHUYLVHG*7DEO  tiveness of semi-supervised training: for 1% of S1 the er-
2XUVVHPLVXSHUYLVHG*7 
 ror increases from 78.1 mm to 91.3 mm which compares to
6 6 6 6 6 6 6 6 $OO
 N N N N N N N 0 100.7 mm for the supervised baseline.
7UDLQLQJGDWD
(c) Full framerate under Protocol 1 with ground-truth 2D poses.
7. Conclusion
Figure 5: Top: comparison with [45] on Protocol 3, using a
downsampled version of the dataset for consistency. Mid- We have introduced a simple fully convolutional model
dle: our method under Protocol 1 (full frame rate). Bottom: for 3D human pose estimation in video. Our architecture ex-
our method under Protocol 1 when trained on ground-truth ploits temporal information with dilated convolutions over
2D poses (full frame rate). The small crosses (“abl.” series) 2D keypoint trajectories. A second contribution of this
denote the ablation of the bone length term. work is back-projection, a semi-supervised training method
to improve performance when labeled data is scarce. The
method works with unlabeled video and only requires in-
6.2. Semi-supervised approach
trinsic camera parameters, making it practical in scenarios
We adopt the setup of [45] who consider various subsets where motion capture is challenging (e.g. outdoor sports).
of the Human3.6M training set as labeled data and the re- Our fully convolutional architecture improves the previ-
maining samples are used as unlabeled data. Their setup ous best result on the popular Human3.6M dataset by 6mm
also generally downsamples all data to 10 FPS (from 50 average joint error which corresponds to a relative reduc-
FPS). Labeled subsets are created by first reducing the num- tion of 11% and also shows improvements on HumanEva-I.
ber of subjects and then by downsampling Subject 1. Back-projection can improve 3D pose estimation accuracy
Since the dataset is downsampled, we use a receptive by about 10mm N-MPJPE (15mm MPJPE) over a strong
field of 9 frames, equivalent to 45 frames upsampled. For baseline when 5K or fewer annotated frames are available.
References In International Conference on Machine Learning (ICML),
pages 448–456, 2015. 3, 5
[1] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero,
[18] C. Ionescu, J. Carreira, and C. Sminchisescu. Iterated
and M. J. Black. Keep it SMPL: Automatic estimation of
second-order label sensitive pooling for 3d human pose es-
3d human pose and shape from a single image. In Euro-
timation. In Conference on Computer Vision and Pattern
pean Conference on Computer Vision (ECCV), pages 561–
Recognition (CVPR), pages 1661–1668, 2014. 2
578. Springer, 2016. 2
[2] E. Brau and H. Jiang. 3d human pose estimation via deep [19] C. Ionescu, F. Li, and C. Sminchisescu. Latent structured
learning from 2d annotations. In International Conference models for human pose estimation. In International Confer-
on 3D Vision (3DV), pages 582–591. IEEE, 2016. 2 ence on Computer Vision (ICCV), 2011. 4
[3] R. Caruana. Multitask learning. Machine learning, [20] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Hu-
28(1):41–75, 1997. 2 man3.6m: Large scale datasets and predictive methods for
[4] C.-H. Chen and D. Ramanan. 3D human pose estimation = 3D human sensing in natural environments. Transaction on
2D pose estimation + matching. In Conference on Computer Pattern Analysis and Machine Intelligence (TPAMI), 2014.
Vision and Pattern Recognition (CVPR), pages 5759–5767, 2, 4, 5
2017. 2 [21] H. Jiang. 3d human pose reconstruction using millions of ex-
[5] Y. Chen, Z. Wang, Y. Peng, and Z. Zhang. Cascaded pyramid emplars. In International Conference on Pattern Recognition
network for multi-person pose estimation. In Conference on (ICPR), pages 1674–1677. IEEE, 2010. 2
Computer Vision and Pattern Recognition (CVPR), 2018. 3, [22] N. Kalchbrenner, L. Espeholt, K. Simonyan, A. van den
5 Oord, A. Graves, and K. Kavukcuoglu. Neural machine
[6] R. Collobert, C. Puhrsch, and G. Synnaeve. Wav2letter: an translation in linear time. arXiv, abs/1610.10099, 2016. 3
end-to-end convnet-based speech recognition system. arXiv [23] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-
preprint arXiv:1609.03193, 2016. 1 to-end recovery of human shape and pose. In Conference
[7] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Lan- on Computer Vision and Pattern Recognition (CVPR), pages
guage modeling with gated convolutional networks. In In- 7122–7131, 2018. 2
ternational Conference on Machine Learning (ICML), 2017. [24] I. Katircioglu, B. Tekin, M. Salzmann, V. Lepetit, and P. Fua.
1 Learning latent representations of 3d human pose with deep
[8] D. Drover, R. M. V, C.-H. Chen, A. Agrawal, A. Tyagi, and neural networks. International Journal of Computer Vision
C. P. Huynh. Can 3d pose be learned from 2d projections (IJCV), pages 1–16, 2018. 2
alone? In European Conference on Computer Vision Work- [25] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and
shops (ECCVW), pages 78–94. Springer, 2018. 2 P. T. P. Tang. On large-batch training for deep learning: Gen-
[9] S. Edunov, M. Ott, M. Auli, and D. Grangier. Understanding eralization gap and sharp minima. In International Confer-
back-translation at scale. In Proc. of EMNLP, 2018. 1 ence on Learning Representations (ICLR), 2017. 12
[10] H. Fang, Y. Xu, W. Wang, X. Liu, and S.-C. Zhu. Learning
[26] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato. Un-
pose grammar to encode human body configuration for 3d
supervised machine translation using monolingual corpora
pose estimation. In AAAI, 2018. 1, 4, 6
only. In International Conference on Learning Representa-
[11] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N.
tions (ICLR), 2018. 1
Dauphin. Convolutional sequence to sequence learning.
In International Conference on Machine Learning (ICML), [27] K. Lee, I. Lee, and S. Lee. Propagating LSTM: 3d pose es-
2017. 1 timation based on joint interdependency. In European Con-
ference on Computer Vision (ECCV), pages 119–135, 2018.
[12] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask
1, 2, 4, 5, 6, 7
R-CNN. In International Conference on Computer Vision
(ICCV), pages 2980–2988. IEEE, 2017. 3, 5 [28] S. Li and A. B. Chan. 3d human pose estimation from
[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning monocular images with deep convolutional neural network.
for image recognition. In Conference on Computer Vision In Asian Conference on Computer Vision (ACCV), pages
and Pattern Recognition (CVPR), pages 770–778, 2016. 3 332–347. Springer, 2014. 2
[14] E. Hoffer, R. Banner, I. Golan, and D. Soudry. Norm mat- [29] S. Li, W. Zhang, and A. B. Chan. Maximum-margin struc-
ters: efficient and accurate normalization schemes in deep tured learning with deep networks for 3d human pose es-
networks. arXiv preprint arXiv:1803.01814, 2018. 12 timation. In International Conference on Computer Vision
[15] M. Holschneider, R. Kronland-Martinet, J. Morlet, and (ICCV), pages 2848–2856, 2015. 4, 5
P. Tchamitchian. A real-time algorithm for signal analy- [30] M. Lin, L. Lin, X. Liang, K. Wang, and H. Cheng. Recurrent
sis with the help of the wavelet transform. Wavelets, Time- 3d pose sequence machines. In Conference on Computer Vi-
Frequency Methods and Phase Space, -1:286, 01 1989. 3 sion and Pattern Recognition (CVPR), pages 810–819, 2017.
[16] M. R. I. Hossain and J. J. Little. Exploiting temporal infor- 2
mation for 3d pose estimation. In European Conference on [31] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and
Computer Vision (ECCV), 2018. 1, 2, 4, 6, 7, 11 S. J. Belongie. Feature pyramid networks for object detec-
[17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating tion. In Conference on Computer Vision and Pattern Recog-
deep network training by reducing internal covariate shift. nition (CVPR), pages 936–944, 2017. 5, 6
[32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- [46] R. Sennrich, B. Haddow, and A. Birch. Neural machine
manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com- translation of rare words with subword units. In Proc. of
mon objects in context. In European conference on computer ACL, 2016. 1
vision (ECCV), pages 740–755. Springer, 2014. 5, 6 [47] L. Sigal, A. O. Balan, and M. J. Black. HumanEva: Syn-
[33] D. C. Luvizon, D. Picard, and H. Tabia. 2d/3d pose esti- chronized video and motion capture dataset and baseline al-
mation and action recognition using multitask deep learning. gorithm for evaluation of articulated human motion. Interna-
In Conference on Computer Vision and Pattern Recognition tional Journal of Computer Vision (IJCV), 87(1-2):4, 2010.
(CVPR), volume 2, 2018. 1, 2, 4, 5, 6 4
[34] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A sim- [48] C. Sminchisescu. 3d human motion analysis in monocular
ple yet effective baseline for 3d human pose estimation. In video: techniques and challenges. In Human Motion, pages
International Conference on Computer Vision (ICCV), pages 185–211. Springer, 2008. 2
2659–2668, 2017. 1, 2, 4, 5, 6, 7, 12 [49] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
[35] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, R. Salakhutdinov. Dropout: a simple way to prevent neural
W. Xu, and C. Theobalt. Monocular 3d human pose esti- networks from overfitting. The Journal of Machine Learning
mation in the wild using improved cnn supervision. In In- Research, 15(1):1929–1958, 2014. 3
ternational Conference on 3D Vision (3DV), pages 506–516. [50] X. Sun, J. Shang, S. Liang, and Y. Wei. Compositional hu-
IEEE, 2017. 2 man pose regression. In International Conference on Com-
[36] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, puter Vision (ICCV), pages 2621–2630, 2017. 1, 4, 6
M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt. [51] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and P. Fua.
Vnect: Real-time 3d human pose estimation with a single rgb Structured prediction of 3d human pose with deep neural net-
camera. ACM Transactions on Graphics (TOG), 36(4):44, works. In British Machine Vision Conference (BMVC), 2016.
2017. 2 2
[37] V. Nair and G. E. Hinton. Rectified linear units improve re- [52] B. Tekin, P. Marquez Neila, M. Salzmann, and P. Fua. Learn-
stricted boltzmann machines. In International Conference ing to fuse 2d and 3d image cues for monocular body pose
on Machine Learning (ICML), pages 807–814, 2010. 3 estimation. In International Conference on Computer Vision
(ICCV), 2017. 1, 2, 4, 6
[38] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
[53] B. Tekin, A. Rozantsev, V. Lepetit, and P. Fua. Direct predic-
works for human pose estimation. In European Conference
tion of 3d body poses from motion compensated sequences.
on Computer Vision, pages 483–499. Springer, 2016. 3, 4
In Conference on Computer Vision and Pattern Recognition
[39] S. Park, J. Hwang, and N. Kwak. 3d human pose estimation (CVPR), pages 991–1000, 2016. 2, 4, 5
using convolutional neural networks with 2d pose informa-
[54] H.-Y. F. Tung, A. W. Harley, W. Seto, and K. Fragkiadaki.
tion. In European Conference on Computer Vision (ECCV),
Adversarial inverse graphics networks: Learning 2d-to-3d
pages 156–169. Springer, 2016. 2
lifting and image-to-image translation from unpaired super-
[40] G. Pavlakos, X. Zhou, and K. Daniilidis. Ordinal depth vision. In International Conference on Computer Vision
supervision for 3d human pose estimation. Conference on (ICCV), pages 4364–4372. IEEE, 2017. 2
Computer Vision and Pattern Recognition (CVPR), 2018. 1, [55] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan,
2, 4, 6, 7 O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and
[41] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. K. Kavukcuoglu. Wavenet: A generative model for raw au-
Coarse-to-fine volumetric prediction for single-image 3d hu- dio. arXiv preprint arXiv:1609.03499, 2016. 1, 3
man pose. In Conference on Computer Vision and Pattern [56] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang.
Recognition (CVPR), pages 1263–1272. IEEE, 2017. 1, 2, 4, 3d human pose estimation in the wild by adversarial learning.
5, 6, 7 In Conference on Computer Vision and Pattern Recognition
[42] V. Ramakrishna, T. Kanade, and Y. Sheikh. Reconstruct- (CVPR), volume 1, 2018. 1, 2, 4, 6
ing 3d human pose from 2d image landmarks. In European [57] F. Yu and V. Koltun. Multi-scale context aggregation by di-
Conference on Computer Vision (ECCV), pages 573–586. lated convolutions. In International Conference on Learning
Springer, 2012. 2 Representations (ICLR), 2016. 3
[43] S. J. Reddi, S. Kale, and S. Kumar. On the convergence of [58] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei. Towards
Adam and beyond. In International Conference on Learning 3d human pose estimation in the wild: a weakly-supervised
Representations (ICLR), 2018. 5 approach. In Conference on Computer Vision and Pattern
[44] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To- Recognition (CVPR), 2017. 2, 4
wards real-time object detection with region proposal net- [59] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and
works. In Advances in Neural Information Processing Sys- K. Daniilidis. Sparseness meets deepness: 3d human pose
tems (NIPS), pages 91–99, 2015. 6 estimation from monocular video. In Conference on Com-
[45] H. Rhodin, M. Salzmann, and P. Fua. Unsupervised puter Vision and Pattern Recognition (CVPR), pages 4966–
geometry-aware representation for 3D human pose estima- 4975, 2016. 4, 5
tion. In European Conference on Computer Vision (ECCV),
2018. 2, 4, 8
A. Supplementary material A.2. Computational cost analysis
A.1. Dilated convolutions and information flow In this section we show how we computed the compu-
tational complexity of our model and that of [16]. The
Dilated convolutions are a particular form of convolu-
common practice is to consider only matrix multiplications,
tion with a sparse structure, whose kernel points are spaced
as other operations (e.g. biases, batch normalization, ac-
uniformly and filled with zeros in between. For instance,
tivations) have negligible impact on the final complexity.
the discrete filter h = [ 1 2 3 ] (where 2 is the center) For [16], we evaluated its reference implementation in Ten-
becomes [ 1 0 2 0 3 ] with dilation factor D = 2, and sorFlow. We computed the amortized cost to predict one
[ 1 0 0 2 0 0 3 ] with D = 3. This particular structure frame using the TensorFlow profiler, and only counted op-
enables optimized implementations that skip computations erations corresponding to matrix multiplications. Accord-
over zero points. ing to TensorFlow’s approximation, multiplying a N × M
Consider a discrete convolution of two signals f (of matrix by a M × K matrix has a cost of 2 N M K FLOPs
length N ) and h (zero-centered, of length 2M + 1), which (floating-point operations), which is equivalent to N M K
can be computed as multiply-add operations.
For our model, we adopted the same convention. We
M
X provide a sample cost analysis for a model with a recep-
(f ∗ h)[n] = f [n − m] h[m] (2) tive field of 27 frames, which consists of 2 residual blocks.
m=−M
Since the matrix multiplications in our model are only due
Instead of spacing the kernel explicitly and applying a reg- to convolutions, the analysis is straightforward and can be
ular convolution, a dilated convolution can be computed as computed by hand.

M
X
(f ∗ hD )[n] = f [D(n − m)] h[m] (3)
0.209 MFLOPs

6.291 MFLOPs

2.097 MFLOPs

6.291 MFLOPs

2.097 MFLOPs

0.104 MFLOPs
m=−M

yielding roughly the same computational cost as regular


convolutions for the same number of non-zero entries, while
Figure 7: Architecture of a model with a receptive field of
increasing the receptive field.
27 frames, with the corresponding amortized cost to predict
For illustration purposes, in Figure 6 we depict the infor-
one frame in convolutional layers.
mation flow in our models. We also highlight the difference
between symmetric convolutions and causal convolutions.
As can be seen in Figure 7, the model consists of 6 con-
volutional layers. Disregarding padding (i.e. for sequences
of length N  0), performing a 1D convolution with Cin
input channels, Cout output channels, and width W has a
cost of 2 N W Cin Cout FLOPs, i.e. 2 W Cin Cout FLOPs
per frame. In our 27-frame model, the results can be sum-
marized as follows:

1. W = 3, channels 17 · 2 → 1024, cost 0.209 MFLOPs.


(a) Symmetric convolutions.
2. W = 3, channels 1024 → 1024, cost 6.291 MFLOPs.

3. W = 1, channels 1024 → 1024, cost 2.097 MFLOPs.

4. W = 3, channels 1024 → 1024, cost 6.291 MFLOPs.

5. W = 1, channels 1024 → 1024, cost 2.097 MFLOPs.

6. W = 1, channels 1024 → 17 · 3, cost 0.104 MFLOPs.


(b) Causal convolutions.
Total: 17.089 MFLOPs per frame.
Figure 6: Information flow in our models, from input (bot-
tom layer) to output (top layer). Dashed lines represent A.3. Ablation of receptive field and channel size
skip-connections. In Figure 8b we report the test error for different recep-
tive fields, namely 1, 9, 27, 82, and 243 frames. To this end,
 convolutions. In a model with a receptive field of 27 frames
 and fine-tuned CPN detections, the error increases from
03-3( PP

48.8 mm to 50.4 mm (+1.6 mm), while also increasing


 the number of parameters and computations by a factor of
 ≈ 3.5. This highlights that dilated convolutions are crucial
 for efficiency, and that they counteract overfitting.
 
  A.5. Batching strategy
 We argue that the reconstruction error is strongly depen-
    dent on how the model is trained, and we suggested to gen-
&KDQQHOVL]H
erate minibatches in a way that only one output frame at a
(a) time is predicted. To show why this is important, we intro-
 duce a new hyperparameter – the chunk size C (or step size),
 which specifies how many frames are predicted at once per
03-3( PP

sample. Predicting only one frame, i.e. C = 1, requires


  a full receptive field F as input. Predicting two frames
(C = 2) requires F + 1 frames, and so on. It is evident
 
that predicting multiple frames is computationally more ef-
  ficient, as the results of intermediate convolutions can be
 shared among frames – and in fact, we do this at inference.
 On the other hand, we show that during training this is detri-
      mental to generalization.
5HFHSWLYHILHOG IUDPHV Figure 9b illustrates the reconstruction error (as well as
(b) the relative speedup in training time) when training a 27-
frame model with different step sizes, namely 1, 2, 4, 8, 16,
Figure 8: Top: Error as a function of the channel size, with and 32 frames. Since predicting multiple frames is equiv-
a fixed receptive field of 27 frames. Bottom: Error as a alent to increasing the batch size – thus hurting generaliza-
function of the receptive field, with a fixed channel size of tion [25] – we make the results comparable by adjusting the
1024. Fine-tuned CPN detections for both experiments. batch size so that the model always predicts 1024 frames.
Therefore, the 1-frame experiment adopts a batch size of
1024 sequences, which becomes 512 for 2 frames, 256 for
we stack a varying number of residual blocks, each of which 4 frames, and so on. This methodology also ensures the
multiplies the receptive field by 3. In the single-frame sce- models will be trained with the same number of weight up-
nario, we use 2 blocks and set the convolution widths of dates.
all layers to 1, obtaining a model functionally equivalent The results show that the error decreases in conjunction
to [34]. As can be seen, the model does not seem to over- with the step size, at the expense of training speed. The
fit as the receptive field increases. On the other hand, the impaired performance of the models trained with high step
error tends to saturate quickly, suggesting that the task of size is caused by correlated batch statistics [14]. Our im-
3D human pose estimation does not require modeling long- plementation optimized for single-frame outputs achieves a
term dependencies. Therefore, we generally adopt a recep- speed-up factor of ≈ 2, but this gain is even higher on mod-
tive field of 243 frames. Similarly, in Figure 8a we vary the els with larger receptive fields (e.g. ≈ 4 with 81 frames),
channel size between 128 and 2048, with the same findings: and enabled us to train the model with 243 frames.
the model is not prone to overfitting, but the error saturates
past a certain point. Since the computational complexity in- A.6. Optimized training implementation
creases quadratically with respect to the channel size, we
adopt C = 1024. Figure 10 shows why our implementation tailored for
single-frame predictions is important. A regular implemen-
A.4. Data augmentation and convolution type tation computes intermediate states layer by layer. This is
very efficient for long sequences, as the states computed in
When we remove test-time augmentation, the error in- layer n can be reused by layer n + 1 without recomputing
creases to 47.7 mm (from 46.8 mm) in our top-performing them. However, for short sequences, this approach becomes
model. If we also remove train-time augmentation, the error inefficient because states near boundaries are not used. In
reaches 49.2 mm (another +1.5 mm). the extreme case of single-frame predictions (which we use
Next, we replace dilated convolutions with regular dense for training), many intermediate computations are wasted,
1 1 1 2 3 4 5 6 7 7 7

1 1 1 2 3
1 1 2 3 4

1 2 3 4 5 (a) Layer-by-layer implementation.


2 3 4 5 6
3 4 5 6 7
4 5 6 7 7
5 6 7 7 7
(b) Implementation optimized for single-frame predictions.
(a)
 Figure 10: Comparison between two implementations for
 a single-frame prediction, receptive field of 27 frames. In

7UDLQLQJVSHHGXS

the layer-by-layer implementation many intermediate states
(UURU PP


 are wasted, whereas the optimized implementation com-
 putes only required states. As the length of the sequence

(UURU
increases, the layer-by-layer implementation becomes more
 6SHHGXS  efficient.
6SHHGXS I
      
&KXQNVL]H IUDPHV
(b)

Figure 9: Top: batch creation process for training. This


example shows a video of 7 frames which is used to train
a model with a receptive field of 5 frames. We generate a
training example for each of the 7 frames, such that only
the center frame is predicted. The video is padded by repli-
cating boundary frames. Bottom: reconstruction error and
training speed-up with different step sizes. The speed-up is
relative to C = 1. The 1f variant shows the speed up corre-
sponding to the implementation optimized for single-frame
predictions.

as can be seen in Figure 10a. In this case, we replace dilated


convolutions with strided convolutions, making sure to ob-
tain a model which is functionally equivalent to the original
one (e.g. by also adapting skip-connections). This strategy
ensures that no intermediate states will be discarded.
As mentioned, at inference we use the regular layer-by-
layer implementation since it is more efficient for multi-
frame predictions.
A.7. Demo videos
The supplementary material contains several videos
highlighting the smoothness of the predictions of our tem-
poral convolutional model compared to the single frame
baseline. Specifically, we show side by side the origi-
nal video sequence, poses predicted by the single-frame
baseline, poses from the temporal convolutional model as
well as the ground-truth poses. Some demo videos can
also be found at https://dariopavllo.github.
io/VideoPose3D.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy