Coupled multiview autoencoders with locality sensitivity for three-dimensional human pose estimation

Abstract. Estimating three-dimensional (3D) human poses from a single camera is usually implemented by searching pose candidates with image descriptors. Existing methods usually suppose that the mapping from feature space to pose space is linear, but in fact, their mapping relationship is highly nonlinear, which heavily degrades the performance of 3D pose estimation. We propose a method to recover 3D pose from a silhouette image. It is based on the multiview feature embedding (MFE) and the locality-sensitive autoencoders (LSAEs). On the one hand, we first depict the manifold regularized sparse low-rank approximation for MFE and then the input image is characterized by a fused feature descriptor. On the other hand, both the fused feature and its corresponding 3D pose are separately encoded by LSAEs. A two-layer back-propagation neural network is trained by parameter fine-tuning and then used to map the encoded 2D features to encoded 3D poses. Our LSAE ensures a good preservation of the local topology of data points. Experimental results demonstrate the effectiveness of our proposed method.

[1]  William T. Freeman,et al.  Bayesian Reconstruction of 3D Human Motion from Single-Camera Video , 1999, NIPS.

[2]  Jane You,et al.  HSAE: A Hessian regularized sparse auto-encoders , 2016, Neurocomputing.

[3]  Xiaogang Wang,et al.  Multi-source Deep Learning for Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[5]  M. Bennamoun,et al.  3D human pose tracking using Gaussian process regression and particle filter applied to gait analysis of Parkinson's disease patients , 2013, 2013 IEEE 8th Conference on Industrial Electronics and Applications (ICIEA).

[6]  Rómer Rosales,et al.  Inferring body pose without tracking body parts , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[7]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Yuan Yan Tang,et al.  Multiview Hessian discriminative sparse coding for image annotation , 2013, Comput. Vis. Image Underst..

[9]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Xian-Sheng Hua,et al.  Ensemble Manifold Regularization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Björn W. Schuller,et al.  Universum Autoencoder-Based Domain Adaptation for Speech Emotion Recognition , 2017, IEEE Signal Processing Letters.

[12]  Yan Zhang,et al.  Hierarchical feature concatenation-based kernel sparse representations for image categorization , 2016, The Visual Computer.

[13]  Yongdong Zhang,et al.  Multiview Spectral Embedding , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[14]  Tom Drummond,et al.  Machine Learning for High-Speed Corner Detection , 2006, ECCV.

[15]  Jindong Liu,et al.  Hand and body association in crowded environments for human-robot interaction , 2013, 2013 IEEE International Conference on Robotics and Automation.

[16]  Marinos Ioannides,et al.  In the wild image retrieval and clustering for 3D cultural heritage landmarks reconstruction , 2014, Multimedia Tools and Applications.

[17]  Jifeng Sun,et al.  Monocular three-dimensional human pose estimation using local-topology preserved sparse retrieval , 2017, J. Electronic Imaging.

[18]  Richa Singh,et al.  Group sparse autoencoder , 2017, Image Vis. Comput..

[19]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[20]  Jiwu Huang,et al.  Near-Duplicate Image Recognition and Content-based Image Retrieval using Adaptive Hierarchical Geometric Centroids , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[21]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[22]  Joan Lasenby,et al.  Fast Upper Body Joint Tracking Using Kinect Pose Priors , 2014, AMDO.

[23]  Mohammad T. Manzuri Shalmani,et al.  3D human pose estimation from image using couple sparse coding , 2014, Machine Vision and Applications.

[24]  Wei Zhang,et al.  Hybrid human detection and recognition in surveillance , 2016, Neurocomputing.

[25]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset for Evaluation of Articulated Human Motion , 2006 .

[26]  Florian Steinke,et al.  Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction , 2009, NIPS.

[27]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Andrew Zisserman,et al.  Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos , 2014, ACCV.

[29]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Jun Yu,et al.  Pairwise constraints based multiview features fusion for scene classification , 2013, Pattern Recognit..

[31]  Vincent Lepetit,et al.  BRIEF: Binary Robust Independent Elementary Features , 2010, ECCV.

[32]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[34]  Lin Sun,et al.  Laplacian Auto-Encoders: An explicit learning of nonlinear data manifold , 2015, Neurocomputing.

[35]  Hao Jiang 3D Human Pose Reconstruction Using Millions of Exemplars , 2010, 2010 20th International Conference on Pattern Recognition.

[36]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[37]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[38]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[40]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[41]  Antoni B. Chan,et al.  Heterogeneous Multi-task Learning for Human Pose Estimation with Deep Convolutional Neural Network , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[42]  Jane You,et al.  Hypergraph Regularized Autoencoder for 3D Human Pose Recovery , 2015, CCCV.

[43]  Noel E. O'Connor,et al.  Evaluating a dancer's performance using kinect-based skeleton tracking , 2011, ACM Multimedia.