3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
semi-supervised training
Abstract
BatchNorm 1D
BatchNorm 1D
BatchNorm 1D
BatchNorm 1D
BatchNorm 1D
BatchNorm 1D
BatchNorm 1D
BatchNorm 1D
2J, 3d1, 1024
1024, 1d1, 3J
Dropout 0.25
Dropout 0.25
Dropout 0.25
Dropout 0.25
Dropout 0.25
Dropout 0.25
Dropout 0.25
Dropout 0.25
Dropout 0.25
(243, 34) (1, 51)
ReLU
ReLU
ReLU
ReLU
ReLU
ReLU
ReLU
ReLU
ReLU
Figure 2: An instantiation of our fully-convolutional 3D pose estimation architecture. The input consists of 2D keypoints for
a recpetive field of 243 frames (B = 4 blocks) with J = 17 joints. Convolutional layers are in green where 2J, 3d1,
1024 denotes 2 · J input channels, kernels of size 3 with dilation 1, and 1024 output channels. We also show tensor sizes
in parentheses for a sample 1-frame prediction, where (243, 34) denotes 243 frames and 34 channels. Due to valid
convolutions, we slice the residuals (left and right, symmetrically) to match the shape of subsequent tensors.
instead of a seq2seq model. Finally, contrary to most of in the input sequence using both past and future data to ex-
the two-step models mentioned in this section (which use ploit temporal information. To evaluate real-time scenarios,
the popular stacked hourglass network [38] for 2D keypoint we also experiment with causal convolutions, i.e. convolu-
detection), we show that Mask R-CNN [12] and cascaded tions that only have access to past frames. Appendix A.1
pyramid network (CPN) [5] detections are more robust for illustrates dilated convolutions and causal convolutions.
3D human pose estimation. Convolutional image models typically apply zero-
padding to obtain as many outputs as inputs. Early experi-
3. Temporal dilated convolutional model ments however showed better results when performing only
unpadded convolutions while padding the input sequence
Our model is a fully convolutional architecture with
with replica of the boundary frames to the left and the right
residual connections that takes a sequence of 2D poses as
(see Appendix A.5, Figure 9a for an illustration).
input and transforms them through temporal convolutions.
Figure 2 shows an instantiation of our architecture for a
Convolutional models enable parallelization over both the
receptive field size of 243 frames with B = 4 blocks. For
batch and the time dimension while RNNs cannot be paral-
convolutional layers, we set W = 3 with C = 1024 output
lelized over time. In convolutional models, the path of the
channels and we use a dropout rate p = 0.25.
gradient between output and input has a fixed length regard-
less of the sequence length, which mitigates vanishing and
4. Semi-supervised approach
exploding gradients which affect RNNs. A convolutional
architecture also offers precise control over the temporal re- We introduce a semi-supervised training method to im-
ceptive field, which we found beneficial to model temporal prove accuracy in settings where the availability of labeled
dependencies for the task of 3D pose estimation. Moreover, 3D ground-truth pose data is limited. We leverage unla-
we employ dilated convolutions [15] to model long-term de- beled video in combination with an off the shelf 2D key-
pendencies while at the same time maintaining efficiency. point detector to extend the supervised loss function with
Architectures with dilated convolutions have been success- a back-projection loss term. We solve an auto-encoding
ful for audio generation [55], semantic segmentation [57] problem on unlabeled data: the encoder (pose estimator)
and machine translation [22]. performs 3D pose estimation from 2D joint coordinates and
The input layer takes the concatenated (x, y) coordi- the decoder (projection layer) projects the 3D pose back to
nates of the J joints for each frame and applies a tempo- 2D joint coordinates. Training penalizes when the 2D joint
ral convolution with kernel size W and C output channels. coordinates from the decoder are far from the original input.
This is followed by B ResNet-style blocks which are sur- Figure 3 represents our method which combines our
rounded by a skip-connection [13]. Each block first per- supervised component with our unsupervised component
forms a 1D convolution with kernel size W and dilation fac- which acts as a regularizer. The two objectives are opti-
tor D = W B , followed by a convolution with kernel size mized jointly, with the labeled data occupying the first half
1. Convolutions (except the very last layer) are followed of a batch, and the unlabeled data occupying the second
by batch normalization [17], rectified linear units [37], and half. For the labeled data we use the ground truth 3D poses
dropout [49]. Each block increases the receptive field expo- as target and train a supervised loss. The unlabeled data is
nentially by a factor of W , while the number of parameters used to implement an autoencoder loss where the predicted
increases only linearly. The filter hyperparameters, W and 3D poses are projected back to 2D and then checked for
D, are set so that the receptive field for any output frame consistency with the input.
forms a tree that covers all input frames (see §1). Finally, the Trajectory model. Due to the perspective projection, the
last layer outputs a prediction of the 3D poses for all frames 2D pose on the screen depends both on the trajectory (i.e.
Global positions Discussion. Our method only requires the camera intrinsic
Trajectory model WMPJPE loss Ground parameters, which are often available for commercial cam-
truth eras.1 The approach is not tied to any specific network ar-
Labeled Pose model MPJPE loss
chitecture and can be applied to any 3D pose detector which
Ground
2D Poses truth
takes 2D keypoints as inputs. In our experiments we use
3D Poses
the architecture described in §3 to map 2D poses to 3D. To
project 3D poses to 2D, we use a simple projection layer
Bone length
Global positions
L2 loss
which considers linear parameters (focal length, principal
Trajectory model point) as well as non-linear lens distortion coefficients (tan-
gential and radial). We found the lens distortions of the
Unlabeled Pose model Projection cameras used in Human3.6M have negligible impact on the
2D Poses pose estimation metric, but we include these terms nonethe-
3D Poses less because they always provide a more accurate modeling
of the real camera projection.
2D MPJPE loss
5. Experimental setup
Figure 3: Semi-supervised training with a 3D pose model
that takes a sequence of possibly predicted 2D poses as in- 5.1. Datasets and Evaluation
put. We regress the 3D trajectory of the person and add a
We evaluate on two motion capture datasets, Hu-
soft-constraint to match the mean bone lengths of the unla-
man3.6M [20, 19] and HumanEva-I [47]. Human3.6M con-
beled predictions to the labeled ones. Everything is trained
tains 3.6 million video frames for 11 subjects, of which
jointly. WMPJPE stands for “Weighted MPJPE”.
seven are annotated with 3D poses. Each subject performs
15 actions that are recorded using four synchronized cam-
the global position of the human root joint) and the 3D pose eras at 50 Hz. Following previous work [41, 52, 34, 50, 10,
(the position of all joints with respect to the root joint). 40, 56, 33], we adopt a 17-joint skeleton, train on five sub-
Without the global position, the subject would always be re- jects (S1, S5, S6, S7, S8), and test on two subjects (S9 and
projected at the center of the screen with a fixed scale. We S11). We train a single model for all actions.
therefore also regress the 3D trajectory of the person, so that HumanEva-I is a much smaller dataset, with three sub-
the back-projection to 2D can be performed correctly. To jects recorded from three camera views at 60 Hz. Follow-
this end, we optimize a second network which regresses the ing [34, 16], we evaluate on three actions (Walk, Jog, Box)
global trajectory in camera space. The latter is added to the by training a different model for each action (single action
pose before projecting it back to 2D. The two networks have – SA). We also report results when training one model for
the same architecture but do not share any weights as we ob- all actions (multi action – MA), as in [41, 27]. We adopt a
served that they affect each other negatively when trained 15-joint skeleton and use the provided train/test split.
in a multi-task fashion. As it becomes increasingly difficult In our experiments, we consider three evaluation pro-
to regress a precise trajectory if the subject is further away tocols: Protocol 1 is the mean per-joint position error
from the camera, we optimize a weighted mean per-joint (MPJPE) in millimeters which is the mean Euclidean dis-
position error (WMPJPE) loss function for the trajectory: tance between predicted joint positions and ground-truth
joint positions and follows [29, 53, 59, 34, 41]. Protocol
1 2 reports the error after alignment with the ground truth
E= kf (x) − yk (1)
yz in translation, rotation, and scale (P-MPJPE) [34, 50, 10,
40, 56, 16]. Protocol 3 aligns predicted poses with the
that is, we weight each sample using the inverse of the
ground-truth only in scale (N-MPJPE) following [45] for
ground-truth depth (yz ) in camera space. Regressing a pre-
semi-supervised experiments.
cise trajectory for far subjects is also unnecessary for our
purposes, since the corresponding 2D keypoints tend to con- 5.2. Implementation details for 2D pose estimation
centrate around a small area.
Bone length L2 loss. We would like to incentivize the pre- Most previous work [34, 58, 52] extracts the subject
diction of plausible 3D poses instead of just copying the in- from ground-truth bounding boxes and then applies the
put. To do so, we found it effective to add a soft constraint stacked hourglass detector to predict the 2D keypoint lo-
to approximately match the mean bone lengths of the sub- cations within the ground-truth bounding box [38]. Our ap-
jects in the unlabeled batch to the subjects of the labeled proach (§3 and §4) does not depend on any particular 2D
batch (“Bone length L2 loss” in Figure 3). This term plays 1 Even low-end devices typically embed this information in the EXIF
Dir. Disc. Eat Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg
Martinez et al. [34] ICCV’17 (∗) 39.5 43.2 46.4 47.0 51.0 56.0 41.4 40.6 56.5 69.4 49.2 45.0 49.5 38.0 43.1 47.7
Sun et al. [50] ICCV’17 (+) 42.1 44.3 45.0 45.4 51.5 53.0 43.2 41.3 59.3 73.3 51.0 44.0 48.0 38.3 44.8 48.3
Fang et al. [10] AAAI’18 38.2 41.7 43.7 44.9 48.5 55.3 40.2 38.2 54.5 64.4 47.2 44.3 47.3 36.7 41.7 45.7
Pavlakos et al. [40] CVPR’18 (+) 34.7 39.8 41.8 38.6 42.5 47.5 38.0 36.6 50.7 56.8 42.6 39.6 43.9 32.1 36.5 41.8
Yang et al. [56] CVPR’18 (+) 26.9 30.9 36.3 39.9 43.9 47.4 28.8 29.4 36.9 58.4 41.5 30.5 29.5 42.5 32.2 37.7
Hossain & Little [16] ECCV’18 (†)(∗) 35.7 39.3 44.6 43.0 47.2 54.0 38.3 37.5 51.6 61.3 46.5 41.4 47.3 34.2 39.4 44.1
Ours, single-frame 36.0 38.7 38.0 41.7 40.1 45.9 37.1 35.4 46.8 53.4 41.4 36.9 43.1 30.3 34.8 40.0
Ours, 243 frames, causal conv. (†) 35.1 37.7 36.1 38.8 38.5 44.7 35.4 34.7 46.7 53.9 39.6 35.4 39.4 27.3 28.6 38.1
Ours, 243 frames, full conv. (†) 34.1 36.1 34.4 37.2 36.4 42.2 34.4 33.6 45.0 52.5 37.4 33.8 37.8 25.6 27.3 36.5
Ours, 243 frames, full conv. (†)(∗) 34.2 36.8 33.9 37.5 37.1 43.2 34.4 33.5 45.3 52.7 37.7 34.1 38.0 25.8 27.7 36.8
(b) Protocol 2: reconstruction error after rigid alignment with the ground truth (P-MPJPE), where available.
Table 1: Reconstruction error on Human3.6M. Legend: (†) uses temporal information. (∗) ground-truth bounding boxes.
(+) extra data – [50, 40, 56, 33] use 2D annotations from the MPII dataset, [40] uses additional data from the Leeds Sports
Pose (LSP) dataset as well as ordinal annotations. [50, 33] evaluate every 64th frame. [16] provided us with corrected results
over the originally published results 3 . Lower is better, best in bold, second best underlined.
mation as the error is about 5 mm higher on average for tector on the final result. Table 3 reports accuracy of our
protocol 1 compared to a single-frame baseline where we model with ground-truth 2D poses, hourglass-network pre-
set the width of all convolution kernels to W = 1. The dictions from [34] (both pre-trained on MPII and fine-tuned
gap is larger for highly dynamic actions, such as “Walk” on Human3.6M), Detectron and CPN (both pre-trained on
(6.7 mm) and “Walk Together” (8.8 mm). The performance COCO and fine-tuned on Human3.6M). Both Mask R-CNN
for a model with causal convolutions is about half way be- and CPN give better performance than the stacked hourglass
tween the single frame baseline and our model; causal con- network. The improvement is likely to be due to the higher
volutions enable online processing by predicting the 3D heatmap resolution, stronger feature combination (feature
pose for the rightmost input frame. Interestingly, ground- pyramid network [31, 44] for Mask R-CNN and RefineNet
truth bounding boxes result in similar performance to pre- for CPN), and the more diverse dataset on which they are
dicted bounding boxes with Mask R-CNN, which suggests pretrained, i.e. COCO [32]. When trained on 2D ground-
that predictions are almost-perfect in our single-subject sce- truth poses, our model improves the lower bound of [34] by
nario. Figure 4 shows examples of predicted poses includ- 8.3 mm, and the LSTM-based approach of Lee et al. [27]
ing the predicted 2D keypoints and we included a video by 1.2 mm for protocol 1. Therefore, our improvements are
illustration in the supplementary material (Appendix A.7) not merely due to a better 2D detector.
as well as at https://dariopavllo.github.io/ Absolute position errors do not measure the smoothness
VideoPose3D. of predictions over time, which is important for video. To
Next, we evaluate the impact of the 2D keypoint de- evaluate this, we measure joint velocity errors (MPJVE),
3 All subsequent results for [16] in this paper were computed by us using corresponding to the MPJPE of the first derivative of the
their public implementation. 3D pose sequences. Table 2 shows that our temporal model
Figure 4: Qualitative results for two videos. Top: video frames with 2D pose overlay. Bottom: 3D reconstruction.
Dir. Disc. Eat Greet Phone Photo Pose Purch. Sit SitD. Smoke Wait WalkD. Walk WalkT. Avg
Single-frame 12.8 12.6 10.3 14.2 10.2 11.3 11.8 11.3 8.2 10.2 10.3 11.3 13.1 13.4 12.9 11.6
Temporal 3.0 3.1 2.2 3.4 2.3 2.7 2.7 3.1 2.1 2.9 2.3 2.4 3.7 3.1 2.8 2.8
Table 2: Velocity error over the 3D poses generated by a convolutional model that considers time and a single-frame baseline.
Table 3: Effect of the 2D detector on the final result, un- Table 5: Computational complexity of various models un-
der Protocol 1 (P1) and Protocol 2 (P2) Legend: ground- der Protocol 1 trained on ground-truth 2D poses. Results
truth (GT), stacked hourglass (SH), Detectron (D), cascaded are without test-time augmentation.
pyramid network (CPN), pre-trained (PT), fine-tuned (FT).