Holistic Planimetric prediction to Local Volumetric prediction for 3D Human Pose Estimation

We propose a novel approach to 3D human pose estimation from a single depth map. Recently, convolutional neural network (CNN) has become a powerful paradigm in computer vision. Many of computer vision tasks have benefited from CNNs, however, the conventional approach to directly regress 3D body joint locations from an image does not yield a noticeably improved performance. In contrast, we formulate the problem as estimating per-voxel likelihood of key body joints from a 3D occupancy grid. We argue that learning a mapping from volumetric input to volumetric output with 3D convolution consistently improves the accuracy when compared to learning a regression from depth map to 3D joint coordinates. We propose a two-stage approach to reduce the computational overhead caused by volumetric representation and 3D convolution: Holistic 2D prediction and Local 3D prediction. In the first stage, Planimetric Network (P-Net) estimates per-pixel likelihood for each body joint in the holistic 2D space. In the second stage, Volumetric Network (V-Net) estimates the per-voxel likelihood of each body joints in the local 3D space around the 2D estimations of the first stage, effectively reducing the computational cost. Our model outperforms existing methods by a large margin in publicly available datasets.

[1]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Hans P. Moravec,et al.  High resolution maps from wide angle sonar , 1985, Proceedings. 1985 IEEE International Conference on Robotics and Automation.

[3]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[4]  Hans-Peter Seidel,et al.  Personalization and Evaluation of a Real-Time Depth-Based Full Body Tracker , 2013, 2013 International Conference on 3D Vision.

[5]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Ruigang Yang,et al.  Real-Time Simultaneous Pose and Shape Estimation for Articulated Objects Using a Single Depth Camera , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[9]  Sebastian Thrun,et al.  Real-Time Human Pose Tracking from Range Data , 2012, ECCV.

[10]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[11]  Reinhard Koch,et al.  Nonlinear Body Pose Estimation from Depth Images , 2005, DAGM-Symposium.

[12]  Ho Yub Jung,et al.  A Sequential Approach to 3D Human Pose Estimation: Separation of Localization and Identification of Body Joints , 2016, ECCV.

[13]  Ho Yub Jung,et al.  Random tree walk toward instantaneous 3D human pose estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Rüdiger Dillmann,et al.  Sensor fusion for 3D human body tracking with an articulated 3D body model , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[17]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Georgios Tzimiropoulos,et al.  Human Pose Estimation via Convolutional Part Heatmap Regression , 2016, ECCV.

[21]  Sebastian Thrun,et al.  Learning Occupancy Grid Maps with Forward Sensor Models , 2003, Auton. Robots.

[22]  Fei-Fei Li,et al.  Towards Viewpoint Invariant 3D Human Pose Estimation , 2016, ECCV.

[23]  Jitendra Malik,et al.  Poselets: Body part detectors trained using 3D human pose annotations , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[24]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[26]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[29]  Danica Kragic,et al.  Monocular real-time 3D articulated hand pose estimation , 2009, 2009 9th IEEE-RAS International Conference on Humanoid Robots.

[30]  Jianxiong Xiao,et al.  Deep Sliding Shapes for Amodal 3D Object Detection in RGB-D Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.