Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation
Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation
Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation
Abstract try. Third, the 3D model output is one step closer to a faith-
ful 3D reconstruction of people in images.
Direct prediction of 3D body pose and shape remains Traditional model-based approaches typically optimize
a challenge even for highly parameterized deep learning an objective function that measures how well the model fits
models. Mapping from the 2D image space to the predic- the image observations – for example, 2D keypoints [6, 24].
tion space is difficult: perspective ambiguities make the These methods do not require paired 3D training data (im-
loss function noisy and training data is scarce. In this ages with 3D pose), but only work well when initialized
paper, we propose a novel approach (Neural Body Fitting close to the solution. By contrast, initialization is not re-
(NBF)). It integrates a statistical body model within a CNN, quired in forward prediction models, such as CNNs that di-
leveraging reliable bottom-up semantic body part segmen- rectly predict 3D keypoints. However many images with 3D
tation and robust top-down body model constraints. NBF pose annotations are required, which are difficult to obtain,
is fully differentiable and can be trained using 2D and 3D unlike images with 2D pose annotations.
annotations. In detailed experiments, we analyze how the
components of our model affect performance, especially the Therefore, like us, a few recent works have proposed hy-
use of part segmentations as an explicit intermediate repre- brid CNN architectures that are trained using model-based
sentation, and present a robust, efficiently trainable frame- loss functions [56, 62, 22, 38]. Specifically, from an im-
work for 3D human pose estimation from 2D images with age, a CNN predicts the parameters of the SMPL 3D body
competitive results on standard benchmarks. Code will be model [28], and the model is re-projected onto the image
made available at http://github.com/mohomran/ to evaluate the loss function in 2D space. Consequently,
neural_body_fitting 2D pose annotations can be used to train such architectures.
While these hybrid approaches share similarities, they all
differ in essential design choices, such as the amount of 3D
1. Introduction vs 2D annotations for supervision and the input representa-
Much research effort has been successfully directed to- tion used to lift to 3D.
wards predicting 3D keypoints and stick-figure representa- To analyze the importance of such components, we intro-
tions from images of people. Here, we consider the more duce Neural Body Fitting (NBF), a framework designed to
challenging problem of estimating the parameters of a de- provide fine-grained control over all parts of the body fitting
tailed statistical human body model from a single image. process. NBF is a hybrid architecture that integrates a sta-
We tackle this problem by incorporating a model of the tistical body model within a CNN. From an RGB image or a
human body into a deep learning architecture, which has semantic segmentation of the image, NBF directly predicts
several advantages. First, the model incorporates limb ori- the parameters of the model; those parameters are passed
entations and shape, which are required for many applica- to SMPL to produce a 3D mesh; the joints of the 3D mesh
tions such as character animation, biomechanics and virtual are then projected to the image closing the loop. Hence,
reality. Second, anthropomorphic constraints are automati- NBF admits both full 3D supervision (in the model or 3D
cally satisfied – for example limb proportions and symme- Euclidean space) and weak 2D supervision (if images with
∗ This work was done while Christoph Lassner and Peter V. Gehler were only 2D annotations are available). NBF combines the ad-
with the MPI for Intelligent Systems and the University of Tübingen, and vantages of direct bottom-up methods and top-down meth-
additionally while Peter was with the University of Würzburg. ods. It requires neither initialization nor large amounts of
1
Figure 1: Given a single 2D image of a person, we predict a semantic body part segmentation. This part segmentation is
represented as a color-coded map and used to predict the parameters of a 3D body model.
pose ✓
Proxy CNN SMPL P (·)
CNN w M(✓, )
shape
Figure 2: Summary of our proposed pipeline. We process the image with a standard semantic segmentation CNN into 12
semantic parts (see Sec. 4.2). An encoding CNN processes the semantic part probability maps to predict SMPL body model
parameters (see Sec. 3.2). We then use our SMPL implementation in Tensorflow to obtain a projection of the pose-defining
points to 2D. With these points, a loss on 2D vertex positions can be back propagated through the entire model (see Sec. 3.3).
F1-score
Segmentation (12 parts) 27.8 33.5 0.6
Segmentation (24 parts) 28.8 31.8 0.5
Joints (14) 28.8 33.4 0.4
Joints (24) 27.7 33.4
0.3
Table 1: Input Type vs. 3D error in millimeters
0.2
1 2 5 10 20 40 80
joint error (in mm)
ference is especially pronounced on the UP-3D dataset, Figure 3: Segmentation quality (F1-score) vs. fit quality
which contains more visual variety than the images of Hu- (3D joint error). The darkness indicates the difficulty of the
man3.6M, with an error drop from 98.5 mm to 27.8 mm pose, i.e. the distance from the upright pose with arms by
when using a 12 part segmentation. This demonstrates that the sides.
a 2D segmentation of the person into sufficient parts carries
a lot of information about 3D pose/shape, while also pro-
viding full spatial coverage of the person (compared to joint Train
Val
VGG ResNet RefineNet GT
heatmaps). Is it then worth learning separate mappings first
from image to part segmentation, and then from part seg- VGG 107.2 119.9 135.5 140.7
mentation to 3D shape/pose? To answer this question we
ResNet 97.1 96.3 112.2 115.6
first need to examine how 3D accuracy is affected by the
RefineNet 89.6 89.9 82.0 83.3
quality of real predicted part segmentations.
GT 62.3 60.5 35.7 27.8
Which Input Quality? To determine the effect of seg-
mentation quality on the results, we train three different part Table 2: Effect of segmentation quality on the quality of the
segmentation networks. Besides RefineNet, we also train 3D fit prediction modules (errjoints3D )
two variants of DeepLab [9], based on VGG-16 [52] and
ResNet-101 [14]. These networks result in IoU scores of
67.1, 57.0, and 53.2 respectively on the UP validation set. the higher the F-1 score, the lower the 3D joint error.
Given these results, we then train four 3D prediction net- Which Types of Supervision? We now examine differ-
works - one for each of the part segmentation networks, and ent combinations of loss terms. The losses we consider are
an additional one using the ground truth segmentations. We Llat (on the latent parameters), L3D (on 3D joint/vertex lo-
report 3D accuracy on the validation set of UP3D for each cations), L2D (on the projected joint/vertex locations). We
of the four 3D networks, diagonal numbers of Table 2. As compare performance using three different error measures:
one would expect, the better the segmentation, the better (i) errjoints3D , the Euclidean distance between ground
the 3D prediction accuracy. As can also be seen in Table 2, truth and predicted SMPL joints (in mm). (ii) P CKh [4],
better segmenters at test time always lead to improved 3D the percentage of correct keypoints with the error thresh-
accuracy, even when the 3D prediction networks are trained old being 50% of head size, which we measure on a per-
with poorer segmenters. This is perhaps surprising, and it example basis. (iii) errquat , quaternion distance error of
indicates that mimicking the statistics of a particular seg- the predicted joint rotations (in radians).
mentation method at training time plays only a minor role Given sufficient data - the full 3D-annotated UTP train-
(for example a network trained with GT segmentations and ing set with mirrored examples (11406) - only applying a
tested using RefineNet segmentations performs comparably loss on the model parameters yields reasonable results, and
to a network that is trained using RefineNet segmentations in this setting, additional loss terms don’t provide benefits.
(83.3mm vs 82mm)). To further analyze the correlation be- When only training with L3D , we obtain similar results in
tween segmentation quality and 3D accuracy, in Figure 3 we terms of errjoints3D , however, interestingly errquat is sig-
plot the relationship between F-1 score and 3D reconstruc- nificantly higher. This indicates that predictions produce
tion error. Each dot represents one image, and the color its accurate 3D joints positions in space, but the limb orienta-
respective difficulty – we use the distance to mean pose as tions are incorrect. This further demonstrates that methods
a proxy measure for difficulty. The plot clearly shows that trained to produce only 3D keypoints do not capture orien-
Loss errjoints3D PCKh errquat Method Mean Median
Ramakrishna et al. [45] 168.4 145.9
Llat 83.7 93.1 0.278
Zhou et al. [68] 110.0 98.9
Llat + L3D 82.3 93.4 0.280
Llat + L2D 83.1 93.5 0.278 SMPLify [6] 79.9 61.9
Llat + L3D + L2D 82.0 93.5 0.279 Random Forests [24] 93.5 77.6
L3D 83.7 93.5 1.962 SMPLify (Dense) [24] 74.5 59.6
L2D 198.0 94.0 1.971 Ours 64.0 49.4
deep CNN architectures. We analyze (1) how the 3D model enables end-to-end training. We found a loss that combines
can be integrated into a deep neural network, (2) how loss 2D as well as 3D information to work best. The flexible im-
functions can be combined and (3) how a training can be set plementation allowed us to experiment with the 3D losses
up that works efficiently with scarce 3D data. only for parts of the data, moving towards a weakly su-
In contrast to existing methods we use a region-based pervised training scenario that avoids expensive 3D labeled
2D representation, namely a 12-body-part segmentation, as data. With 3D information for only 20% of our training
an intermediate step prior to the mapping to 3D shape and data, we could reach similar performance as with full 3D
pose. This segmentation provides full spatial coverage of annotations.
a person as opposed to the commonly used sparse set of We believe that this encouraging result is an important
keypoints, while also retaining enough information about finding for the design of future datasets and the develop-
the arrangement of parts to allow for effective lifting to 3D. ment of 3D prediction methods that do not require expen-
We used a stack of CNN layers on top of a segmentation sive 3D annotations for training. Future work will involve
model to predict an encoding in the space of 3D model pa- extending this to more challenging settings involving mul-
rameters, followed by a Tensorflow implementation of the tiple, possibly occluded, people.
3D model and a projection to the image plane. This full Acknowledgements We would like to thank Dingfan
integration allows us to finely tune the loss functions and Chen for help with training the HMR [22] model.
References [13] N. Hasler, C. Stoll, M. Sunkel, B. Rosenhahn, and H.-
P. Seidel. A statistical model of human pose and body
[1] I. Akhter and M. J. Black. Pose-conditioned joint an- shape. In Computer Graphics Forum, volume 28, pages
gle limits for 3d human pose reconstruction. In IEEE 337–346, 2009. 2
Conference on Computer Vision and Pattern Recogni-
tion (CVPR), pages 1446–1455, 2015. 3, 7 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
learning for image recognition. In IEEE Conference
[2] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and on Computer Vision and Pattern Recognition (CVPR),
G. Pons-Moll. Detailed human avatars from monocu- 2016. 5, 6
lar video. In 3D Vision (3DV), 2018 Sixth International
Conference on, 2018. 3 [15] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang,
E. Levinkov, B. Andres, and B. Schiele. Arttrack: Ar-
[3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and ticulated multi-person tracking in the wild. In 30th
G. Pons-Moll. Video based reconstruction of 3d peo- IEEE Conference on Computer Vision and Pattern
ple models. In IEEE Conference on Computer Vision Recognition (CVPR 2017), Honolulu, HI, USA, 2017.
and Pattern Recognition, June 2018. 3 IEEE. 2
[4] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. [16] E. Insafutdinov, L. Pishchulin, B. Andres, M. An-
2D Human Pose Estimation: New Benchmark and State driluka, and B. Schiele. Deepercut: A deeper, stronger,
of the Art Analysis. In IEEE Conference on Computer and faster multi-person pose estimation model. In Eu-
Vision and Pattern Recognition (CVPR), June 2014. 5, ropean Conference on Computer Vision (ECCV), 2016.
6 5
[5] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, [17] C. Ionescu, D. Papava, V. Olaru, and C. Sminchis-
J. Rodgers, and J. Davis. Scape: shape completion and escu. Human3.6m: Large scale datasets and predictive
animation of people. In ACM Transactions on Graphics methods for 3d human sensing in natural environments.
(TOG), volume 24, pages 408–416. ACM, 2005. 2 IEEE Transactions on Pattern Analysis and Machine
[6] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, Intelligence (PAMI), 36(7):1325–1339, 2014. 3, 5
J. Romero, and M. J. Black. Keep it SMPL: Automatic [18] D. Jack, F. Maire, A. Eriksson, and S. Shirazi. Adver-
estimation of 3D human pose and shape from a single sarially parameterized optimization for 3d human pose
image. In European Conference on Computer Vision estimation. 2017. 2
(ECCV), 2016. 1, 2, 7
[19] E. Jahangiri and A. L. Yuille. Generating multiple
[7] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Real- diverse hypotheses for human 3d pose consistent with
time multi-person 2d pose estimation using part affinity 2d joint detections. In IEEE International Conference
fields. In Proc. of CVPR, 2016. 2 on Computer Vision (ICCV) Workshops (PeopleCap),
[8] C.-H. Chen and D. Ramanan. 3d human pose esti- 2017. 3
mation = 2d pose estimation + matching. In CVPR [20] S. Johnson and M. Everingham. Clustered pose and
2017-IEEE Conference on Computer Vision & Pattern nonlinear appearance models for human pose estima-
Recognition, 2017. 3 tion. In British Machine Vision Conference (BMVC),
[9] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, 2010. doi:10.5244/C.24.12. 5
and A. L. Yuille. Deeplab: Semantic image seg- [21] H. Joo, T. Simon, and Y. Sheikh. Total capture: A 3d
mentation with deep convolutional nets, atrous con- deformation model for tracking faces, hands, and bod-
volution, and fully connected crfs. arXiv preprint ies. arXiv preprint arXiv:1801.01615, 2018. 2
arXiv:1606.00915, 2016. 6 [22] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik.
[10] M. Dantone, J. Gall, C. Leistner, and L. V. Gool. Body End-to-end recovery of human shape and pose. In Pro-
parts dependent joint regressors for human pose esti- ceedings of the IEEE Conference on Computer Vision
mation in still images. IEEE Transactions on Pattern and Pattern Recognition, 2018. 1, 2, 3, 7, 8
Analysis and Machine Intelligence, 36(11):2131–2143, [23] C. Lassner, G. Pons-Moll, and P. V. Gehler. A gen-
Nov 2014. 5 erative model of people in clothing. In International
[11] D. M. Gavrila and L. S. Davis. 3-d model-based track- Conference on Computer Vision (ICCV), 2017. 3
ing of humans in action: a multi-view approach. In [24] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J.
Proc. CVPR, pages 73–80. IEEE, 1996. 2 Black, and P. V. Gehler. Unite the people: Closing
[12] R. A. Güler, N. Neverova, and I. Kokkinos. Dense- the loop between 3d and 2d human representations. In
pose: Dense human pose estimation in the wild. arXiv IEEE Conf. on Computer Vision and Pattern Recogni-
preprint arXiv:1802.00434, 2018. 3 tion (CVPR), 2017. 1, 2, 4, 5, 7
[25] S. Li and A. B. Chan. 3d human pose estimation [37] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Dani-
from monocular images with deep convolutional neu- ilidis. Coarse-to-fine volumetric prediction for single-
ral network. In Asian Conference on Computer Vision image 3D human pose. In CVPR 2017-IEEE Confer-
(ACCV), pages 332–347, 2014. 3 ence on Computer Vision & Pattern Recognition, 2017.
[26] S. Li, W. Zhang, and A. B. Chan. Maximum-margin 3
structured learning with deep networks for 3d human [38] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis.
pose estimation. In IEEE International Conference on Learning to estimate 3D human pose and shape from
Computer Vision (ICCV), pages 2848–2856, 2015. 3 a single color image. In Proceedings of the IEEE Con-
[27] G. Lin and I. R. Anton Milan, Chunhua Shen. Re- ference on Computer Vision and Pattern Recognition,
finenet: Multi-path refinement networks for high- 2018. 1, 3, 4, 7
resolution semantic segmentation. In Computer Vision [39] R. Plankers and P. Fua. Articulated soft objects for
and Pattern Recognition (CVPR), 2017. 5 video-based body modeling. In International Confer-
[28] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, ence on Computer Vision, Vancouver, Canada, number
and M. J. Black. SMPL: A skinned multi-person lin- CVLAB-CONF-2001-005, pages 394–401, 2001. 2
ear model. ACM Trans. Graphics (Proc. SIGGRAPH [40] G. Pons-Moll, D. J. Fleet, and B. Rosenhahn. Posebits
Asia), 34(6):248:1–248:16, Oct. 2015. 1, 2, 3 for monocular human pose estimation. In IEEE Con-
[29] M. M. Loper, N. Mahmood, and M. J. Black. MoSh: ference on Computer Vision and Pattern Recognition
Motion and shape capture from sparse markers. ACM (CVPR), pages 2337–2344, 2014. 3
Transactions on Graphics, (Proc. SIGGRAPH Asia), [41] G. Pons-Moll, J. Romero, N. Mahmood, and M. J.
33(6):220:1–220:13, Nov. 2014. 5 Black. Dyna: a model of dynamic human shape in mo-
tion. ACM Transactions on Graphics (TOG), 34:120,
[30] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A
2015. 2
simple yet effective baseline for 3d human pose estima-
tion. In IEEE International Conference on Computer [42] G. Pons-Moll and B. Rosenhahn. Model-Based Pose
Vision (ICCV), 2017. 3 Estimation, chapter 9, pages 139–170. Springer, 2011.
2
[31] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotny-
chenko, W. Xu, and C. Theobalt. Monocular 3d human [43] G. Pons-Moll, J. Taylor, J. Shotton, A. Hertzmann,
pose estimation in the wild using improved cnn super- and A. Fitzgibbon. Metric regression forests for cor-
vision. In 3D Vision (3DV), 2017 Fifth International respondence estimation. International Journal of Com-
Conference on. IEEE, 2017. 3 puter Vision, 113(3):163–175, 2015. 3
[32] D. Mehta, O. Sotnychenko, F. Mueller, W. Xu, S. Srid- [44] A.-I. Popa, M. Zanfir, and C. Sminchisescu. Deep
har, G. Pons-Moll, and C. Theobalt. Single-shot multi- multitask architecture for integrated 2d and 3d human
person 3d pose estimation from monocular rgb. In 3D sensing. IEEE Conference on Computer Vision and Pat-
Vision (3DV), 2018 Sixth International Conference on, tern Recognition (CVPR), 2017. 3
2018. 3 [45] V. Ramakrishna, T. Kanade, and Y. Sheikh. Recon-
[33] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, structing 3d human pose from 2d image landmarks. In
M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and European Conference on Computer Vision, pages 573–
C. Theobalt. Vnect: Real-time 3d human pose esti- 586. Springer, 2012. 7
mation with a single rgb camera. ACM Transactions on [46] G. Rogez and C. Schmid. Mocap-guided data aug-
Graphics, 36(4), July 2017. 3 mentation for 3d pose estimation in the wild. In
[34] D. Metaxas and D. Terzopoulos. Shape and non- Advances in Neural Information Processing Systems,
rigid motion estimation through physics-based synthe- pages 3108–3116, 2016. 3
sis. IEEE Trans. PAMI, 15(6):580–591, 1993. 2 [47] G. Rogez, P. Weinzaepfel, and C. Schmid. Lcr-net++:
[35] F. Moreno-Noguer. 3d human pose estimation from a Multi-person 2d and 3d pose detection in natural im-
single image via distance matrix regression. In CVPR ages. arXiv preprint arXiv:1803.00455, 2018. 3
2017-IEEE Conference on Computer Vision & Pattern [48] N. Sarafianos, B. Boteanu, B. Ionescu, and I. A. Kaka-
Recognition, 2017. 3 diaris. 3d human pose estimation: A review of the liter-
[36] G. Pavlakos, X. Zhou, and K. Daniilidis. Ordinal ature and analysis of covariates. Computer Vision and
depth supervision for 3d human pose estimation. In Image Understanding, 152:1–20, 2016. 2
Proceedings of the IEEE Conference on Computer Vi- [49] L. Sigal, A. O. Balan, and M. J. Black. Humaneva:
sion and Pattern Recognition, 2018. 3 Synchronized video and motion capture dataset and
baseline algorithm for evaluation of articulated human [61] A. Tewari, M. Zollhöfer, H. Kim, P. Garrido,
motion. International Journal of Computer Vision F. Bernard, P. Perez, and C. Theobalt. Mofa: Model-
(IJCV), 87(1-2):4–27, 2010. 3, 5 based deep convolutional face autoencoder for unsu-
pervised monocular reconstruction. In The IEEE Inter-
[50] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Is-
national Conference on Computer Vision (ICCV), vol-
ard. Tracking loose-limbed people. In Computer Vision
ume 2, 2017. 3
and Pattern Recognition, 2004. CVPR 2004. Proceed-
ings of the 2004 IEEE Computer Society Conference [62] H.-Y. Tung, H.-W. Tung, E. Yumer, and K. Fragki-
on, volume 1, pages I–421. IEEE, 2004. 2 adaki. Self-supervised learning of motion capture. In
[51] E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno- Advances in Neural Information Processing Systems,
Noguer. A joint model for 2d and 3d pose estima- pages 5242–5252, 2017. 1, 3, 7
tion from a single image. In Conference on Computer [63] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer,
Vision and Pattern Recognition (CVPR), pages 3634– I. Laptev, and C. Schmid. Bodynet: Volumetric in-
3641, 2013. 3 ference of 3d human body shapes. arXiv preprint
[52] K. Simonyan and A. Zisserman. Very deep con- arXiv:1804.04875, 2018. 3
volutional networks for large-scale image recognition. [64] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J.
arXiv preprint arXiv:1409.1556, 2014. 6 Black, I. Laptev, and C. Schmid. Learning from Syn-
[53] C. Sminchisescu and B. Triggs. Kinematic jump pro- thetic Humans. In CVPR, 2017. 3
cesses for monocular 3d human tracking. In Com- [65] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh.
puter Vision and Pattern Recognition, 2003. Proceed- Convolutional pose machines. In 2015 IEEE Con-
ings. 2003 IEEE Computer Society Conference on, vol- ference on Computer Vision and Pattern Recognition
ume 1, pages I–I. IEEE, 2003. 3 (CVPR), 2016. 2
[54] C. Stoll, N. Hasler, J. Gall, H.-P. Seidel, and [66] H. Yasin, U. Iqbal, B. Krüger, A. Weber, and J. Gall.
C. Theobalt. Fast articulated motion tracking using a A Dual-Source Approach for 3D Pose Estimation from
sums of gaussians body model. In Proc. ICCV, pages a Single Image. In Conference on Computer Vision and
951–958. IEEE, 2011. 2 Pattern Recognition (CVPR), 2016. 3
[55] X. Sun, J. Shang, S. Liang, and Y. Wei. Com- [67] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei. To-
positional human pose regression. arXiv preprint wards 3d human pose estimation in the wild: A weakly-
arXiv:1704.00159, 2017. 3 supervised approach. In Proceedings of the IEEE Con-
[56] J. K. V. Tan, I. Budvytis, and R. Cipolla. Indirect deep ference on Computer Vision and Pattern Recognition,
structured learning for 3d human body shape and pose pages 398–407, 2017. 3
prediction. In BMVC, volume 3, page 6, 2017. 1 [68] X. Zhou, S. Leonardos, X. Hu, and K. Daniilidis. 3D
[57] C. J. Taylor. Reconstruction of articulated objects shape estimation from 2D landmarks: A convex re-
from point correspondences in a single uncalibrated im- laxation approach. In IEEE Conference on Computer
age. In IEEE Conference on Computer Vision and Pat- Vision and Pattern Recognition (CVPR), pages 4447–
tern Recognition (CVPR), volume 1, pages 677–684, 4455, 2015. 7
2000. 3 [69] X. Zhou, M. Zhu, S. Leonardos, K. Derpanis, and
[58] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon. The K. Daniilidis. Sparseness Meets Deepness: 3D Hu-
vitruvian manifold: Inferring dense correspondences man Pose Estimation from Monocular Video. In IEEE
for one-shot human pose estimation. In Computer Vi- Conference on Computer Vision and Pattern Recogni-
sion and Pattern Recognition (CVPR), 2012 IEEE Con- tion (CVPR), 2015. 3
ference on, pages 103–110. IEEE, 2012. 3 [70] X. Zhou, M. Zhu, G. Pavlakos, S. Leonardos, K. G.
[59] B. Tekin, I. Katircioglu, M. Salzmann, V. Lepetit, and Derpanis, and K. Daniilidis. Monocap: Monocular hu-
P. Fua. Structured Prediction of 3D Human Pose with man motion capture using a cnn coupled with a geo-
Deep Neural Networks. In British Machine Vision Con- metric prior. IEEE Transactions on Pattern Analysis
ference (BMVC), 2016. 3 and Machine Intelligence, 2018. 3
[60] B. Tekin, P. Márquez-Neila, M. Salzmann, and P. Fua. [71] S. Zuffi and M. J. Black. The stitched puppet: A
Learning to fuse 2d and 3d image cues for monocular graphical model of 3d human shape and pose. In
body pose estimation. In IEEE International Confer- 2015 IEEE Conference on Computer Vision and Pattern
ence on Computer Vision (ICCV), 2017. 3 Recognition (CVPR), pages 3537–3546. IEEE, 2015. 2
A. Further Qualitative Results critical to train both the segmentation network and the fit-
ting network with strong data augmentation, especially by
One of our findings is the high correlation between in- introducing random jitter and scaling. For the fitting net-
put segmentation quality and output fit quality. We pro- work, such augmentation has to take place prior to training
vide some additional qualitative examples that illustrate this since it affects the SMPL parameters. We also mirror the
correlation. In Fig. 5, we present the four worst examples data, but this requires careful mirroring of both the part la-
from the validation set in terms of 3D joint reconstruction bels as well as the SMPL parameters. This involves remap-
error when we use our trained part segmentation network; in ping the parts, as well as inverting the part rotations.
Fig. 6, we present the worst examples when the network is
trained to predict body model parameters given the ground References
truth segmentations. This does not correct all estimated 3D
bodies, but the remaining errors are noticeably less severe. [1] V. Belagiannis, C. Rupprecht, G. Carneiro, and
N. Navab. Robust optimization for deep regression.
B. Training Details 2015.
[2] S. Geman and D. McClure. Statistical methods for
We present examples of paired training examples and
tomographic image reconstruction. Bulletin of the In-
ground truth in Fig 7.
ternational Statistical Institute, 52(4):5–21, 1987.
Segmentation Network We train our own TensorFlow [3] D. P. Kingma and J. Ba. Adam: A method for stochas-
implementation of a RefineNet [4] network (based on tic optimization. 2014.
ResNet-101) to predict the part segmentations. The images [4] G. Lin and I. R. Anton Milan, Chunhua Shen. Re-
are cropped to 512x512 pixels, and we train for 20 epochs finenet: Multi-path refinement networks for high-
with a batch size of 5 using the Adam [3] optimizer. Learn- resolution semantic segmentation. In Computer Vision
ing rate and weight decay are set to 0.00002 and 0.0001 and Pattern Recognition (CVPR), 2017.
respectively, with a polynomial learning rate decay. Data
augmentation improved performance a lot, in particular hor-
izontal reflection (which requires re-mapping the labels for
left and right limbs), scale augmentation (0.9 - 1.1 of the
original size) as well as rotations (up to 45 degrees). For
training the segmentation network on UP-3D we used the
5703 training images. For Human3.6M we subsampled the
videos, only using every 10th frame from each video, which
results in about 32000 frames. Depending on the amount of
data, training the segmentation networks takes about 6-12
hours on a Volta V100 machine.
Figure 6: Worst examples from the validation set in terms of 3D error given perfect segmentations.
Figure 7: Example training images annotations illustrating different types and granularities.