Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps

The recovery of 3D human pose with monocular camera is an inherently ill-posed problem due to the large number of possible projections from the same 2D image to 3D space. Aimed at improving the accuracy of 3D motion reconstruction, we introduce the additional built-in knowledge, namely height-map, into the algorithmic scheme of reconstructing the 3D pose/motion under a single-view calibrated camera. Our novel proposed framework consists of two major contributions. Firstly, the RGB image and its calculated height-map are combined to detect the landmarks of 2D joints with a dual-stream deep convolution network. Secondly, we formulate a new objective function to estimate 3D motion from the detected 2D joints in the monocular image sequence, which reinforces the temporal coherence constraints on both the camera and 3D poses. Experiments with HumanEva, Human3.6M, and MCAD dataset validate that our method outperforms the state-of-the-art algorithms on both 2D joints localization and 3D motion recovery. Moreover, the evaluation results on HumanEva indicates that the performance of our proposed single-view approach is comparable to that of the multi-view deep learning counterpart.

[1]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[2]  J. J. Moré,et al.  Levenberg--Marquardt algorithm: implementation and theory , 1977 .

[3]  Francesc Moreno-Noguer,et al.  A Joint Model for 2D and 3D Pose Estimation from a Single Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[5]  V. Lepetit,et al.  EPnP: An Accurate O(n) Solution to the PnP Problem , 2009, International Journal of Computer Vision.

[6]  Jonathan Tompson,et al.  Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Dariu Gavrila,et al.  Multi-view 3D Human Pose Estimation in Complex Environment , 2011, International Journal of Computer Vision.

[8]  Vincent Lepetit,et al.  Direct Prediction of 3D Body Poses from Motion Compensated Sequences , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Cristian Sminchisescu,et al.  Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Alan L. Yuille,et al.  Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations , 2014, NIPS.

[11]  Ilya Kostrikov,et al.  Depth Sweep Regression Forests for Estimating 3D Human Pose from Images , 2014, BMVC.

[12]  S. Benbakreti,et al.  Gait recognition based on leg motion and contour of silhouette , 2012, 2012 International Conference on Information Technology and e-Services.

[13]  Bodo Rosenhahn,et al.  3D human motion capture from monocular image sequences , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  Jitendra Malik,et al.  Aligning 3D models to RGB-D images of cluttered scenes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Fernando De la Torre,et al.  Spatio-Temporal Matching for Human Pose Estimation in Video , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[17]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[18]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[20]  Juergen Gall,et al.  A Dual-Source Approach for 3D Pose Estimation from a Single Image , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Bernt Schiele,et al.  Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[22]  Kang Zheng,et al.  Combining local appearance and holistic view: Dual-Source Deep Neural Networks for human pose estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Martin A. Fischler,et al.  The Representation and Matching of Pictorial Structures , 1973, IEEE Transactions on Computers.

[24]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[27]  Jong-Soo Choi,et al.  Robust Estimation of Heights of Moving People Using a Single Camera , 2011, ICITCS.

[28]  Meng Wang,et al.  Multimodal Deep Autoencoder for Human Pose Recovery , 2015, IEEE Transactions on Image Processing.

[29]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[30]  Francesc Moreno-Noguer,et al.  Single image 3D human pose estimation from noisy observations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Mohan S. Kankanhalli,et al.  Multi-Camera Action Dataset (MCAD): A Dataset for Studying Non-overlapped Cross-Camera Action Recognition , 2016, ArXiv.

[32]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Daniel P. Huttenlocher,et al.  Pictorial Structures for Object Recognition , 2004, International Journal of Computer Vision.

[34]  Antti Oulasvirta,et al.  Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Wen Gao,et al.  Robust Estimation of 3D Human Poses from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[38]  Mark Everingham,et al.  Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation , 2010, BMVC.

[39]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[40]  Bernt Schiele,et al.  Pictorial structures revisited: People detection and articulated pose estimation , 2009, CVPR.

[41]  Hans-Peter Seidel,et al.  Markerless Motion Capture with unsynchronized moving cameras , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[42]  Mubarak Shah,et al.  Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Jitendra Malik,et al.  Learning Rich Features from RGB-D Images for Object Detection and Segmentation , 2014, ECCV.