论文信息 - Structured Prediction of 3D Human Pose with Deep Neural Networks

Structured Prediction of 3D Human Pose with Deep Neural Networks

Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from image to 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images that relies on an overcomplete autoencoder to learn a high-dimensional latent pose representation and account for joint dependencies. We demonstrate that our approach outperforms state-of-the-art ones both in terms of structure preservation and prediction accuracy.

[1] Meng Wang,et al. Multimodal Deep Autoencoder for Human Pose Recovery , 2015, IEEE Transactions on Image Processing.

[2] Ankur Agarwal,et al. 3D human pose from silhouettes by relevance vector regression , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[3] Pascal Vincent,et al. Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[4] Antoni B. Chan,et al. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[5] David J. Fleet,et al. Stochastic Tracking of 3D Human Figures Using 2D Image Motion , 2000, ECCV.

[6] Antoni B. Chan,et al. Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7] T. Kanade,et al. Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[8] Vincent Lepetit,et al. Direct Prediction of 3D Body Poses from Motion Compensated Sequences , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Bernt Schiele,et al. Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10] Roland Memisevic,et al. Zero-bias autoencoders and the benefits of co-adapting features , 2014, ICLR.

[11] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[12] Michael J. Black,et al. Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14] David J. Fleet,et al. Priors for people tracking from small training sets , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[15] Jonathan Tompson,et al. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[16] Vincent Lepetit,et al. Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[17] Hans-Peter Seidel,et al. Optimization and Filtering for Human Motion Capture , 2010, International Journal of Computer Vision.

[18] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19] Cristian Sminchisescu,et al. Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[20] Raquel Urtasun,et al. Implicitly Constrained Gaussian Process Regression for Monocular Non-Rigid Pose Estimation , 2010, NIPS.

[21] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.

[22] Andrew Zisserman,et al. Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23] Cristian Sminchisescu,et al. Kinematic jump processes for monocular 3D human tracking , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[24] Bernhard Schölkopf,et al. Kernel Dependency Estimation , 2002, NIPS.

[25] Yoshua Bengio,et al. Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[26] Jonathan Tompson,et al. Learning Human Pose Estimation Features with Convolutional Networks , 2013, ICLR.

[27] Toby Sharp,et al. Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[28] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[29] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[30] Cristian Sminchisescu,et al. Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[31] Nassir Navab,et al. Human Shape and Pose Tracking Using Keyframes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Christian Szegedy,et al. DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33] Andrew W. Fitzgibbon,et al. Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[34] Jason Weston,et al. A general regression technique for learning transductions , 2005, ICML '05.

[35] Raquel Urtasun,et al. Combining discriminative and generative methods for 3D deformable surface and articulated pose reconstruction , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.