Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry

Training accurate 3D human pose estimators requires large amount of 3D ground-truth data which is costly to collect. Various weakly or self supervised pose estimation methods have been proposed due to lack of 3D data. Nevertheless, these methods, in addition to 2D ground-truth poses, require either additional supervision in various forms (e.g. unpaired 3D ground truth data, a small subset of labels) or the camera parameters in multiview settings. To address these problems, we present EpipolarPose, a self-supervised learning method for 3D human pose estimation, which does not need any 3D ground-truth data or camera extrinsics. During training, EpipolarPose estimates 2D poses from multi-view images, and then, utilizes epipolar geometry to obtain a 3D pose and camera geometry which are subsequently used to train a 3D pose estimator. We demonstrate the effectiveness of our approach on standard benchmark datasets (i.e. Human3.6M and MPI-INF-3DHP) where we set the new state-of-the-art among weakly/self-supervised methods. Furthermore, we propose a new performance measure Pose Structure Score (PSS) which is a scale invariant, structure aware measure to evaluate the structural plausibility of a pose with respect to its ground truth. Code and pretrained models are available at https://github.com/mkocabas/EpipolarPose

[1]  David Nistér,et al.  An efficient solution to the five-point relative pose problem , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[2]  Alexandru Tupan,et al.  Triangulation , 1997, Comput. Vis. Image Underst..

[3]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[4]  Christoph Schnörr,et al.  A Study of Parts-Based Object Class Detection Using Complete Graphs , 2010, International Journal of Computer Vision.

[5]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[6]  Bernt Schiele,et al.  Multi-view Pictorial Structures for 3D Human Pose Estimation , 2013, BMVC.

[7]  Stefan Carlsson,et al.  3D Pictorial Structures for Multiple View Articulated Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Nassir Navab,et al.  3D Pictorial Structures for Multiple Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[13]  Jonathan Tompson,et al.  Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15]  Joshua B. Tenenbaum,et al.  Deep Convolutional Inverse Graphics Network , 2015, NIPS.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[20]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[21]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Fiora Pirri,et al.  Bayesian Image Based 3D Pose Estimation , 2016, ECCV.

[23]  Vincent Lepetit,et al.  Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[24]  B. Schiele,et al.  3D Pictorial Structures Revisited: Multiple Human Pose Estimation. , 2016, IEEE transactions on pattern analysis and machine intelligence.

[25]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Pascal Fua,et al.  Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Lourdes Agapito,et al.  Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Edilson de Aguiar,et al.  MARCOnI—ConvNet-Based MARker-Less Motion Capture in Outdoor and Indoor Scenes , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Deva Ramanan,et al.  3D Human Pose Estimation = 2D Pose Estimation + Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Francesc Moreno-Noguer,et al.  3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[32]  Cordelia Schmid,et al.  LCR-Net: Localization-Classification-Regression for Human Pose , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[36]  Song-Chun Zhu,et al.  Monocular 3D Human Pose Estimation by Predicting Depth on Joints , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Xiaowei Zhou,et al.  Harvesting Multiple Views for Marker-Less 3D Human Pose Annotations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Katerina Fragkiadaki,et al.  Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Emre Akbas,et al.  MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network , 2018, ECCV.

[40]  Kai Oliver Arras,et al.  How Robust is 3D Human Pose Estimation to Occlusion? , 2018, ArXiv.

[41]  Pascal Fua,et al.  Learning Monocular 3D Human Pose Estimation from Multi-view Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Xiaowei Zhou,et al.  Ordinal Depth Supervision for 3D Human Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Ambrish Tyagi,et al.  Can 3D Pose be Learned from 2D Projections Alone? , 2018, ECCV Workshops.

[44]  Yichen Wei,et al.  Compositional Human Pose Regression , 2018, Comput. Vis. Image Underst..

[45]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[46]  Jonathan Tompson,et al.  Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning , 2018, NeurIPS.

[47]  Song-Chun Zhu,et al.  Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation , 2017, AAAI.