Ordinal Depth Supervision for 3D Human Pose Estimation

Our ability to train end-to-end systems for 3D human pose estimation from single images is currently constrained by the limited availability of 3D annotations for natural images. Most datasets are captured using Motion Capture (MoCap) systems in a studio setting and it is difficult to reach the variability of 2D human pose datasets, like MPII or LSP. To alleviate the need for accurate 3D ground truth, we propose to use a weaker supervision signal provided by the ordinal depths of human joints. This information can be acquired by human annotators for a wide range of images and poses. We showcase the effectiveness and flexibility of training Convolutional Networks (ConvNets) with these ordinal relations in different settings, always achieving competitive performance with ConvNets trained with accurate 3D joint coordinates. Additionally, to demonstrate the potential of the approach, we augment the popular LSP and MPII datasets with ordinal depth annotations. This extension allows us to present quantitative and qualitative evaluation in non-studio conditions. Simultaneously, these ordinal annotations can be easily incorporated in the training procedure of typical ConvNets for 3D human pose. Through this inclusion we achieve new state-of-the-art performance for the relevant benchmarks and validate the effectiveness of ordinal depth supervision for 3D human pose.

[1]  Zhenhua Wang,et al.  Synthesizing Training Images for Boosting Human 3D Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[2]  Leonidas J. Guibas,et al.  ObjectNet3D: A Large Scale Database for 3D Object Recognition , 2016, ECCV.

[3]  Ehsan Jahangiri,et al.  Generating Multiple Hypotheses for Human 3D Pose Consistent with 2D Joint Detections , 2017, ArXiv.

[4]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[5]  Alexei A. Efros,et al.  Learning Data-Driven Reflectance Priors for Intrinsic Image Decomposition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Cristian Sminchisescu,et al.  Pictorial Human Spaces: A Computational Study on the Human Perception of 3D Articulated Poses , 2016, International Journal of Computer Vision.

[7]  Bernt Schiele,et al.  Articulated Multi-person Tracking in the Wild , 2016, ArXiv.

[8]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[9]  Hossein Azizpour,et al.  Multi-view Body Part Recognition with Random Forests , 2013, BMVC.

[10]  Lourdes Agapito,et al.  Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ehsan Jahangiri,et al.  Generating Multiple Diverse Hypotheses for Human 3D Pose Consistent with 2D Joint Detections , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[12]  Camillo J. Taylor,et al.  Reconstruction of Articulated Objects from Point Correspondences in a Single Uncalibrated Image , 2000, Comput. Vis. Image Underst..

[13]  William T. Freeman,et al.  Learning Ordinal Relationships for Mid-Level Vision , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Peter V. Gehler,et al.  DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Stella X. Yu,et al.  Learning lightness from human judgement on relative reflectance , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Yichen Wei,et al.  Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Song-Chun Zhu,et al.  Monocular 3D Human Pose Estimation by Predicting Depth on Joints , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Noah Snavely,et al.  Intrinsic images in the wild , 2014, ACM Trans. Graph..

[23]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[24]  Xiaogang Wang,et al.  Learning Feature Pyramids for Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Xiaowei Zhou,et al.  3D Shape Reconstruction from 2D Landmarks: A Convex Formulation , 2014, ArXiv.

[26]  Mohan S. Kankanhalli,et al.  Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps , 2016, ECCV.

[27]  Weifeng Chen,et al.  Single-Image Depth Perception in the Wild , 2016, NIPS.

[28]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[29]  Zhiao Huang,et al.  Associative Embedding: End-to-End Learning for Joint Detection and Grouping , 2016, NIPS.

[30]  Xiaowei Zhou,et al.  MonoCap: Monocular Human Motion Capture using a CNN Coupled with a Geometric Prior , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Silvio Savarese,et al.  Beyond PASCAL: A benchmark for 3D object detection in the wild , 2014, IEEE Winter Conference on Applications of Computer Vision.

[34]  Andrew Zisserman,et al.  Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[35]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[36]  Wei Zhang,et al.  Deep Kinematic Pose Regression , 2016, ECCV Workshops.

[37]  Bodo Rosenhahn,et al.  Posebits for Monocular Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[39]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[40]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Roland Göcke,et al.  Monocular Image 3D Human Pose Estimation under Self-Occlusion , 2013, 2013 IEEE International Conference on Computer Vision.

[42]  Stephen E. Robertson,et al.  SoftRank: optimizing non-smooth rank metrics , 2008, WSDM '08.

[43]  Yi Yang,et al.  Attention to Scale: Scale-Aware Semantic Image Segmentation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Wen Gao,et al.  Robust Estimation of 3D Human Poses from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[46]  Yuandong Tian,et al.  Single Image 3D Interpreter Network , 2016, ECCV.

[47]  Pascal Fua,et al.  Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Francesc Moreno-Noguer,et al.  3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Peter V. Gehler,et al.  Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Ioannis A. Kakadiaris,et al.  3D Human pose estimation: A review of the literature and analysis of covariates , 2016, Comput. Vis. Image Underst..

[51]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[52]  Subhransu Maji,et al.  Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[53]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[54]  Juergen Gall,et al.  A Dual-Source Approach for 3D Pose Estimation from a Single Image , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Yichen Wei,et al.  Compositional Human Pose Regression , 2018, Comput. Vis. Image Underst..

[56]  James T Todd,et al.  The visual perception of 3-D shape from multiple cues: Are observers capable of perceiving metric structure? , 2003, Perception & psychophysics.

[57]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Cristian Sminchisescu,et al.  Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[59]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[60]  Vincent Lepetit,et al.  Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[61]  Cristian Sminchisescu,et al.  Deep Multitask Architecture for Integrated 2D and 3D Human Sensing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[63]  Francesc Moreno-Noguer,et al.  A Joint Model for 2D and 3D Pose Estimation from a Single Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[64]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Xiaowei Zhou,et al.  Sparse Representation for 3D Shape Estimation: A Convex Relaxation Approach , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Alan L. Yuille,et al.  Zoom Better to See Clearer: Human and Object Parsing with Hierarchical Auto-Zoom Net , 2015, ECCV.

[67]  Vincent Lepetit,et al.  Direct Prediction of 3D Body Poses from Motion Compensated Sequences , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Ilya Kostrikov,et al.  Depth Sweep Regression Forests for Estimating 3D Human Pose from Images , 2014, BMVC.

[69]  Cordelia Schmid,et al.  LCR-Net: Localization-Classification-Regression for Human Pose , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Deva Ramanan,et al.  3D Human Pose Estimation = 2D Pose Estimation + Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Cordelia Schmid,et al.  MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.