论文信息 - Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose

Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose

This paper addresses the challenge of 3D human pose estimation from a single color image. Despite the general success of the end-to-end learning paradigm, top performing approaches employ a two-step solution consisting of a Convolutional Network (ConvNet) for 2D joint localization and a subsequent optimization step to recover 3D pose. In this paper, we identify the representation of 3D pose as a critical issue with current ConvNet approaches and make two important contributions towards validating the value of end-to-end learning for this task. First, we propose a fine discretization of the 3D space around the subject and train a ConvNet to predict per voxel likelihoods for each joint. This creates a natural representation for 3D pose and greatly improves performance over the direct regression of joint coordinates. Second, to further improve upon initial estimates, we employ a coarse-to-fine prediction scheme. This step addresses the large dimensionality increase and enables iterative refinement and repeated processing of the image features. The proposed approach outperforms all state-of-the-art methods on standard benchmarks achieving a relative error reduction greater than 30% on average. Additionally, we investigate using our volumetric representation in a related architecture which is suboptimal compared to our end-to-end approach, but is of practical interest, since it enables training when no image with corresponding 3D groundtruth is available, and allows us to present compelling results for in-the-wild images.

[1] Hsi-Jian Lee,et al. Determination of 3D human body postures from a single view , 1985, Comput. Vis. Graph. Image Process..

[2] Ankur Agarwal,et al. Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] David J. Fleet,et al. 3D People Tracking with Gaussian Process Dynamical Models , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4] Hans-Peter Seidel,et al. Optimization and Filtering for Human Motion Capture , 2010, International Journal of Computer Vision.

[5] Cristian Sminchisescu,et al. Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[6] Michael J. Black,et al. HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[7] Bernt Schiele,et al. Monocular 3D pose estimation and tracking by detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8] Michael Isard,et al. Loose-limbed People: Estimating 3D Human Pose and Motion Using Non-parametric Belief Propagation , 2011, International Journal of Computer Vision.

[9] Yaser Sheikh,et al. 3D reconstruction of a smooth articulated trajectory from a monocular image sequence , 2011, 2011 International Conference on Computer Vision.

[10] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[11] Francesc Moreno-Noguer,et al. Single image 3D human pose estimation from noisy observations , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12] T. Kanade,et al. Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[13] Roland Göcke,et al. Monocular Image 3D Human Pose Estimation under Self-Occlusion , 2013, 2013 IEEE International Conference on Computer Vision.

[14] Stefan Carlsson,et al. 3D Pictorial Structures for Multiple View Articulated Pose Estimation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[15] Francesc Moreno-Noguer,et al. A Joint Model for 2D and 3D Pose Estimation from a Single Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Hossein Azizpour,et al. Multi-view Body Part Recognition with Random Forests , 2013, BMVC.

[17] Jonathan Tompson,et al. Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[18] Ilya Kostrikov,et al. Depth Sweep Regression Forests for Estimating 3D Human Pose from Images , 2014, BMVC.

[19] Nassir Navab,et al. 3D Pictorial Structures for Multiple Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] Bernt Schiele,et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22] Wen Gao,et al. Robust Estimation of 3D Human Poses from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23] Antoni B. Chan,et al. 3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[24] Jonathan Tompson,et al. Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Xiaowei Zhou,et al. 3D Shape Reconstruction from 2D Landmarks: A Convex Formulation , 2014, ArXiv.

[26] Antoni B. Chan,et al. Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Andrew Zisserman,et al. Flowing ConvNets for Human Pose Estimation in Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29] Zhuowen Tu,et al. Deeply-Supervised Nets , 2014, AISTATS.

[30] Mohan S. Kankanhalli,et al. Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps , 2016, ECCV.

[31] Xiaowei Zhou,et al. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Ioannis A. Kakadiaris,et al. 3D Human pose estimation: A review of the literature and analysis of covariates , 2016, Comput. Vis. Image Underst..

[33] Peter V. Gehler,et al. Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[34] Jitendra Malik,et al. Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Yuandong Tian,et al. Single Image 3D Interpreter Network , 2016, ECCV.

[36] Zhenhua Wang,et al. Synthesizing Training Images for Boosting Human 3D Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[37] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[38] Varun Ramakrishna,et al. Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Fiora Pirri,et al. Bayesian Image Based 3D Pose Estimation , 2016, ECCV.

[40] Vincent Lepetit,et al. Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[41] Juergen Gall,et al. A Dual-Source Approach for 3D Pose Estimation from a Single Image , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Nojun Kwak,et al. 3D Human Pose Estimation Using Convolutional Neural Networks with 2D Pose Information , 2016, ECCV Workshops.

[43] Sudeep Sarkar,et al. Learning Camera Viewpoint Using CNN to Improve 3D Body Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[44] Wei Zhang,et al. Deep Kinematic Pose Regression , 2016, ECCV Workshops.

[45] Cordelia Schmid,et al. MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[46] Vincent Lepetit,et al. Direct Prediction of 3D Body Poses from Motion Compensated Sequences , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Xiaowei Zhou,et al. Harvesting Multiple Views for Marker-Less 3D Human Pose Annotations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).