Multimodal Deep Autoencoder for Human Pose Recovery

Video-based human pose recovery is usually conducted by retrieving relevant poses using image features. In the retrieving process, the mapping between 2D images and 3D poses is assumed to be linear in most of the traditional methods. However, their relationships are inherently non-linear, which limits recovery performance of these methods. In this paper, we propose a novel pose recovery method using non-linear mapping with multi-layered deep neural network. It is based on feature extraction with multimodal fusion and back-propagation deep learning. In multimodal fusion, we construct hypergraph Laplacian with low-rank representation. In this way, we obtain a unified feature description by standard eigen-decomposition of the hypergraph Laplacian matrix. In back-propagation deep learning, we learn a non-linear mapping from 2D images to 3D poses with parameter fine-tuning. The experimental results on three data sets show that the recovery error has been reduced by 20%-25%, which demonstrates the effectiveness of the proposed method.

[1]  Ben Taskar,et al.  MODEC: Multimodal Decomposable Models for Human Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Jia Chen,et al.  Unified Structured Learning for Simultaneous Human Pose Estimation and Garment Attribute Classification , 2014, IEEE Transactions on Image Processing.

[3]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[5]  Matthew Brand,et al.  Shadow puppetry , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[6]  Yong Liu,et al.  Latent Gaussian Mixture Regression for Human Pose Estimation , 2010, ACCV.

[7]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[8]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Jitendra Malik,et al.  Estimating Human Body Configurations Using Shape Context Matching , 2002, ECCV.

[10]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  David J. Fleet,et al.  Stochastic Tracking of 3D Human Figures Using 2D Image Motion , 2000, ECCV.

[12]  Antoni B. Chan,et al.  A Robust Likelihood Function for 3D Human Pose Tracking , 2014, IEEE Transactions on Image Processing.

[13]  Wen Gao,et al.  Robust Estimation of 3D Human Poses from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Yongdong Zhang,et al.  Multiview Spectral Embedding , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[15]  Jitendra Malik,et al.  Efficient shape matching using shape contexts , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Jiwu Huang,et al.  Near-Duplicate Image Recognition and Content-based Image Retrieval using Adaptive Hierarchical Geometric Centroids , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[17]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[18]  Xuelong Li,et al.  Principal Component 2-D Long Short-Term Memory for Font Recognition on Single Chinese Characters , 2016, IEEE Transactions on Cybernetics.

[19]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[20]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[21]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[22]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23]  Kai-Fu Yang,et al.  Multifeature-Based Surround Inhibition Improves Contour Detection in Natural Images , 2014, IEEE Transactions on Image Processing.

[24]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[25]  Bernhard Schölkopf,et al.  Learning with Hypergraphs: Clustering, Classification, and Embedding , 2006, NIPS.

[26]  Yong Yu,et al.  Robust Subspace Segmentation by Low-Rank Representation , 2010, ICML.

[27]  Ramakant Nevatia,et al.  Human Pose Tracking in Monocular Sequence Using Multilevel Structured Models , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[29]  Shuicheng Yan,et al.  Multi-task low-rank affinity pursuit for image segmentation , 2011, 2011 International Conference on Computer Vision.

[30]  Rómer Rosales,et al.  Inferring body pose without tracking body parts , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[31]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Rasmus Berg Palm,et al.  Prediction as a candidate for learning deep hierarchical models of data , 2012 .

[33]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[34]  Cristian Sminchisescu,et al.  Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[35]  Tao Zhang,et al.  Vision-Based Pose Estimation From Points With Unknown Correspondences , 2014, IEEE Transactions on Image Processing.

[36]  Theo Gevers,et al.  Evaluation of Color Spatio-Temporal Interest Points for Human Action Recognition , 2014, IEEE Transactions on Image Processing.

[37]  Xiaogang Wang,et al.  Multi-source Deep Learning for Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[39]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[40]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[41]  Mingli Song,et al.  Manifold Ranking-Based Matrix Factorization for Saliency Detection , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[42]  William T. Freeman,et al.  Bayesian Reconstruction of 3D Human Motion from Single-Camera Video , 1999, NIPS.

[43]  Zongben Xu,et al.  Enhancing Low-Rank Subspace Clustering by Manifold Regularization , 2014, IEEE Transactions on Image Processing.