Learning Markerless Human Pose Estimation from Multiple Viewpoint Video

We present a novel human performance capture technique capable of robustly estimating the pose (articulated joint positions) of a performer observed passively via multiple view-point video (MVV). An affine invariant pose descriptor is learned using a convolutional neural network (CNN) trained over volumetric data extracted from a MVV dataset of diverse human pose and appearance. A manifold embedding is learned via Gaussian Processes for the CNN descriptor and articulated pose spaces enabling regression and so estimation of human pose from MVV input. The learned descriptor and manifold are shown to generalise over a wide range of human poses, providing an efficient performance capture solution that requires no fiducials or other markers to be worn. The system is evaluated against ground truth joint configuration data from a commercial marker-based pose estimation system.

[1]  Adrian Hilton,et al.  Shape Similarity for 3D Video Sequences of People , 2010, International Journal of Computer Vision.

[2]  Trevor Darrell,et al.  A Bayesian approach to image-based visual hull reconstruction , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[3]  Neil D. Lawrence,et al.  Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005, J. Mach. Learn. Res..

[4]  Vincent Lepetit,et al.  Learning descriptors for object recognition and 3D pose estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Vittorio Ferrari,et al.  Better Appearance Models for Pictorial Structures , 2009, BMVC.

[6]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[7]  Deva Ramanan,et al.  Articulated pose estimation with tiny synthetic videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[8]  Kostas Daniilidis,et al.  Spherical Correlation of Visual Representations for 3D Model Retrieval , 2009, International Journal of Computer Vision.

[9]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[11]  Shamik Sural,et al.  Object Tracking Using Background Subtraction and Motion Estimation in MPEG Videos , 2006, ACCV.

[12]  Ramakant Nevatia,et al.  Bayesian human segmentation in crowded situations , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[13]  Qi-Xing Huang,et al.  Dense Human Body Correspondences Using Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[15]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Adrian Hilton,et al.  Hybrid Skeletal-Surface Motion Graphs for Character Animation from 4D Performance Capture , 2015, TOGS.

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Jonathan Tompson,et al.  Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).