Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation

This paper focuses on structured-output learning using deep neural networks for 3D human pose estimation from monocular images. Our network takes an image and 3D pose as inputs and outputs a score value, which is high when the image-pose pair matches and low otherwise. The network structure consists of a convolutional neural network for image feature extraction, followed by two sub-networks for transforming the image features and pose into a joint embedding. The score function is then the dot-product between the image and pose embeddings. The image-pose embedding and score function are jointly trained using a maximum-margin cost function. Our proposed framework can be interpreted as a special form of structured support vector machines where the joint feature space is discriminatively learned using deep neural networks. We also propose an efficient recurrent neural network for performing inference with the learned image-embedding. We test our framework on the Human3.6m dataset and obtain state-of-the-art results compared to other recent methods. Finally, we present visualizations of the image-pose embedding space, demonstrating the network has learned a high-level embedding of body-orientation and pose-configuration.

[1]  Paul H. Calamai,et al.  Projected gradient methods for linearly constrained problems , 1987, Math. Program..

[2]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[3]  Ben Taskar,et al.  MODEC: Multimodal Decomposable Models for Human Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  José A. Rodríguez-Serrano,et al.  Label embedding for text recognition , 2013, BMVC.

[5]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[6]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[7]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[8]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Cristian Sminchisescu,et al.  Structural SVM for visual localization and continuous state estimation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[10]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[11]  Gustavo Carneiro,et al.  Deep structured learning for mass segmentation from mammograms , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[12]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Vittorio Ferrari,et al.  Better Appearance Models for Pictorial Structures , 2009, BMVC.

[14]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[15]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[16]  Yann LeCun,et al.  Synergistic Face Detection and Pose Estimation with Energy-Based Models , 2004, J. Mach. Learn. Res..

[17]  Ian D. Reid,et al.  Articulated Body Motion Capture by Stochastic Search , 2005, International Journal of Computer Vision.

[18]  Yoshua Bengio,et al.  Better Mixing via Deep Representations , 2012, ICML.

[19]  Richard M. Murray,et al.  A Mathematical Introduction to Robotic Manipulation , 1994 .

[20]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[21]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[22]  Antoni B. Chan,et al.  Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[23]  Xiaogang Wang,et al.  Deep Learning Face Representation from Predicting 10,000 Classes , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[25]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[27]  Xiaogang Wang,et al.  Multi-task Recurrent Neural Network for Immediacy Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[29]  Jonathan Tompson,et al.  Learning Human Pose Estimation Features with Convolutional Networks , 2013, ICLR.

[30]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[31]  Andrew Zisserman,et al.  Deep Structured Output Learning for Unconstrained Text Recognition , 2014, ICLR.

[32]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[33]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[34]  Jitendra Malik,et al.  Twist Based Acquisition and Tracking of Animal and Human Kinematics , 2004, International Journal of Computer Vision.

[35]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, ICCV.

[38]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Cristian Sminchisescu,et al.  Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[40]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[41]  Alan L. Yuille,et al.  Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[42]  Stefan Carlsson,et al.  3D Pictorial Structures for Multiple View Articulated Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Vibhav Vineet,et al.  Conditional Random Fields as Recurrent Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.