Exploiting Temporal Information for 3D Human Pose Estimation

In this work, we address the problem of 3D human pose estimation from a sequence of 2D human poses. Although the recent success of deep networks has led many state-of-the-art methods for 3D pose estimation to train deep networks end-to-end to predict from images directly, the top-performing approaches have shown the effectiveness of dividing the task of 3D pose estimation into two steps: using a state-of-the-art 2D pose estimator to estimate the 2D pose from images and then mapping them into 3D space. They also showed that a low-dimensional representation like 2D locations of a set of joints can be discriminative enough to estimate 3D pose with high accuracy. However, estimation of 3D pose for individual frames leads to temporally incoherent estimates due to independent error in each frame causing jitter. Therefore, in this work we utilize the temporal information across a sequence of 2D joint locations to estimate a sequence of 3D poses. We designed a sequence-to-sequence network composed of layer-normalized LSTM units with shortcut connections connecting the input to the output on the decoder side and imposed temporal smoothness constraint during training. We found that the knowledge of temporal consistency improves the best reported result on Human3.6M dataset by approximately \(12.2\%\) and helps our network to recover temporally consistent 3D poses over a sequence of images even when the 2D pose detector fails.

[1]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[2]  Vincent Lepetit,et al.  Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[3]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Wei Zhang,et al.  Deep Kinematic Pose Regression , 2016, ECCV Workshops.

[6]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Cristian Sminchisescu,et al.  Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[9]  Song-Chun Zhu,et al.  Learning Knowledge-guided Pose Grammar Machine for 3D Human Pose Estimation , 2017, ArXiv.

[10]  Hui Cheng,et al.  Recurrent 3D Pose Sequence Machines , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Camillo J. Taylor,et al.  Reconstruction of Articulated Objects from Point Correspondences in a Single Uncalibrated Image , 2000, Comput. Vis. Image Underst..

[14]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[15]  Francesc Moreno-Noguer,et al.  A Joint Model for 2D and 3D Pose Estimation from a Single Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[19]  Vincent Lepetit,et al.  Direct Prediction of 3D Body Poses from Motion Compensated Sequences , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Nojun Kwak,et al.  3D Human Pose Estimation Using Convolutional Neural Networks with 2D Pose Information , 2016, ECCV Workshops.

[21]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Roland Göcke,et al.  Monocular Image 3D Human Pose Estimation under Self-Occlusion , 2013, 2013 IEEE International Conference on Computer Vision.

[23]  Sudeep Sarkar,et al.  Learning Camera Viewpoint Using CNN to Improve 3D Body Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[24]  Ilya Kostrikov,et al.  Depth Sweep Regression Forests for Estimating 3D Human Pose from Images , 2014, BMVC.

[25]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[26]  Pascal Fua,et al.  Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Rama Chellappa,et al.  View independent human body pose estimation from a single perspective image , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[28]  Francesc Moreno-Noguer,et al.  3D Human Pose Estimation from a Single Image via Distance Matrix Regression , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Lourdes Agapito,et al.  Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Hao Jiang 3D Human Pose Reconstruction Using Millions of Exemplars , 2010, 2010 20th International Conference on Pattern Recognition.

[33]  Song-Chun Zhu,et al.  Monocular 3D Human Pose Estimation by Predicting Depth on Joints , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Wen Gao,et al.  Robust Estimation of 3D Human Poses from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Song-Chun Zhu,et al.  Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation , 2017, AAAI.

[38]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[40]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[41]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[42]  Hsi-Jian Lee,et al.  Determination of 3D human body postures from a single view , 1985, Comput. Vis. Graph. Image Process..

[43]  Juergen Gall,et al.  A Dual-Source Approach for 3D Pose Estimation from a Single Image , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yichen Wei,et al.  Compositional Human Pose Regression , 2018, Comput. Vis. Image Underst..

[45]  Ankur Agarwal,et al.  3D human pose from silhouettes by relevance vector regression , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[46]  Jitendra Malik,et al.  Recovering 3D human body configurations using shape contexts , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[48]  Mohan S. Kankanhalli,et al.  Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps , 2016, ECCV.

[49]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[50]  Cristian Sminchisescu,et al.  Fast algorithms for large scale conditional 3D prediction , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[52]  James J. Little,et al.  3D Pose from Motion for Cross-View Action Recognition via Non-linear Circulant Temporal Encoding , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Ioannis A. Kakadiaris,et al.  Estimating Anthropometry and Pose from a Single Uncalibrated Image , 2001, Comput. Vis. Image Underst..

[54]  Deva Ramanan,et al.  3D Human Pose Estimation = 2D Pose Estimation + Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Cordelia Schmid,et al.  MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[56]  Christian Theobalt,et al.  Monocular 3D Human Pose Estimation Using Transfer Learning and Improved CNN Supervision , 2016, ArXiv.

[57]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.