论文信息 - Learning Latent Representations of 3D Human Pose with Deep Neural Networks

Learning Latent Representations of 3D Human Pose with Deep Neural Networks

Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from an image to a 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images or 2D joint location heatmaps that relies on an overcomplete autoencoder to learn a high-dimensional latent pose representation and accounts for joint dependencies. We further propose an efficient Long Short-Term Memory network to enforce temporal consistency on 3D pose predictions. We demonstrate that our approach achieves state-of-the-art performance both in terms of structure preservation and prediction accuracy on standard 3D human pose estimation benchmarks.

[1] Mohan S. Kankanhalli,et al. Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps , 2016, ECCV.

[2] Lourdes Agapito,et al. Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Jonathan Tompson,et al. Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Antoni B. Chan,et al. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[5] Francesc Moreno-Noguer,et al. A Joint Model for 2D and 3D Pose Estimation from a Single Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Jitendra Malik,et al. Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7] Vincent Lepetit,et al. Direct Prediction of 3D Body Poses from Motion Compensated Sequences , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Ronan Collobert,et al. Recurrent Convolutional Neural Networks for Scene Labeling , 2014, ICML.

[10] Varun Ramakrishna,et al. Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Alan L. Yuille,et al. Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[12] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[13] Zhenhua Wang,et al. Synthesizing Training Images for Boosting Human 3D Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[14] Jitendra Malik,et al. Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Fiora Pirri,et al. Bayesian Image Based 3D Pose Estimation , 2016, ECCV.

[16] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[17] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[18] Cristian Sminchisescu,et al. Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[19] Pascal Fua,et al. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[20] Lukás Burget,et al. Recurrent Neural Network Based Language Modeling in Meeting Recognition , 2011, INTERSPEECH.

[21] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[22] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[23] Jonathan Tompson,et al. Learning Human Pose Estimation Features with Convolutional Networks , 2013, ICLR.

[24] Christian Szegedy,et al. DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Geoffrey E. Hinton,et al. Generating Text with Recurrent Neural Networks , 2011, ICML.

[26] Wei Zhang,et al. Deep Kinematic Pose Regression , 2016, ECCV Workshops.

[27] Rama Chellappa,et al. Face Association across Unconstrained Video Frames Using Conditional Random Fields , 2012, ECCV.

[28] Antoni B. Chan,et al. Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29] Mark Everingham,et al. Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[30] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[31] Xiaolin Hu,et al. Recurrent convolutional neural network for object recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33] Cristian Sminchisescu,et al. Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[34] Raquel Urtasun,et al. Implicitly Constrained Gaussian Process Regression for Monocular Non-Rigid Pose Estimation , 2010, NIPS.

[35] Juergen Gall,et al. A Dual-Source Approach for 3D Pose Estimation from a Single Image , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Ankur Agarwal,et al. 3D human pose from silhouettes by relevance vector regression , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[37] Meng Wang,et al. Multimodal Deep Autoencoder for Human Pose Recovery , 2015, IEEE Transactions on Image Processing.

[38] Jason Weston,et al. A general regression technique for learning transductions , 2005, ICML '05.

[39] Peter V. Gehler,et al. Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[40] T. Kanade,et al. Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[41] Vincent Lepetit,et al. Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[42] Pascal Fua,et al. Making Action Recognition Robust to Occlusions and Viewpoint Changes , 2010, ECCV.

[43] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[44] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45] Cristian Sminchisescu,et al. Deep Multitask Architecture for Integrated 2D and 3D Human Sensing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Michael J. Black,et al. HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion , 2006 .

[47] Jonathan Tompson,et al. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[48] Jürgen Schmidhuber,et al. Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition , 2005, ICANN.

[49] Stefan Carlsson,et al. 3D Pictorial Structures for Multiple View Articulated Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[50] Nojun Kwak,et al. 3D Human Pose Estimation Using Convolutional Neural Networks with 2D Pose Information , 2016, ECCV Workshops.

[51] Michael J. Black,et al. Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Silvio Savarese,et al. Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Yi Yang,et al. Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[55] Xiaowei Zhou,et al. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56] Peter V. Gehler,et al. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Xiaowei Zhou,et al. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Navdeep Jaitly,et al. Chained Predictions Using Convolutional Neural Networks , 2016, ECCV.

[59] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[60] Andrew Zisserman,et al. Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[61] Cordelia Schmid,et al. MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[62] Roland Memisevic,et al. Zero-bias autoencoders and the benefits of co-adapting features , 2014, ICLR.

[63] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Pascal Vincent,et al. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[65] Vincent Lepetit,et al. Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.