Supervised High-Dimension Endecoder Net: 3D End to End Prediction Network for Mark-less Human Pose Estimation from Single Depth Map

Most of the existing deep learning-based methods for Mark-less human pose estimation from a single depth map are based on a common framework that takes a 2D depth map and directly regresses the 3D coordinates of human body joints via 2D convolutional neural networks (CNNs). But depth map is intrinsically 3D data, treat it as 2D images will distort the shape of the actual object through projection from 3D to 2D space, and compels the network to perform perspective distortion-invariant estimation. Moreover, directly regressing 3D coordinates from a 2D image is a highly nonlinear mapping, which causes difficulty in learning procedure. To overcome these problems, a module called Supervised Endecoder is proposed to process 3D convolution data, which can also be stacked through series connection to adapt different size of dataset. Based on the module, a network called Supervised High Dimension Endecoder Network is designed, which can be used to predict key points of markless human in a single depth map in 3D space. Experiments show improved prediction accuracy compared to the state-of-the-art approaches.

[1]  Kyoung Mu Lee,et al.  V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[4]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[5]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[7]  Guijin Wang,et al.  Towards Good Practices for Deep 3D Hand Pose Estimation , 2017, ArXiv.

[8]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Fei-Fei Li,et al.  Towards Viewpoint Invariant 3D Human Pose Estimation , 2016, ECCV.

[10]  Vincent Lepetit,et al.  DeepPrior++: Improving Fast and Accurate 3D Hand Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[11]  Kyoung Mu Lee,et al.  Holistic Planimetric prediction to Local Volumetric prediction for 3D Human Pose Estimation , 2017, ArXiv.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  James J. Little,et al.  Real-Time Human Motion Capture with Multiple Depth Cameras , 2016, 2016 13th Conference on Computer and Robot Vision (CRV).

[14]  Ho Yub Jung,et al.  Random tree walk toward instantaneous 3D human pose estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Vincent Dumoulin,et al.  Deconvolution and Checkerboard Artifacts , 2016 .

[16]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.