Learning Camera Viewpoint Using CNN to Improve 3D Body Pose Estimation

The objective of this work is to estimate 3D human pose from a single RGB image. Extracting image representations which incorporate both spatial relation of body parts and their relative depth plays an essential role in accurate3D pose reconstruction. In this paper, for the first time, we show that camera viewpoint in combination to 2D joint locations significantly improves 3D pose accuracy without the explicit use of perspective geometry mathematical models. To this end, we train a deep Convolutional Neural Net-work (CNN) to learn categorical camera viewpoint. To make the network robust against clothing and body shape of the subject in the image, we utilized 3D computer rendering to synthesize additional training images. We test our framework on the largest 3D pose estimation bench-mark, Human3.6m, and achieve up to 20% error reduction on standing-pose activities compared to the state-of-the-art approaches that do not use body part segmentation.

[1]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[2]  T. Kanade,et al.  Reconstructing 3 D Human Pose from 2 D Image Landmarks , 2012 .

[3]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Antoni B. Chan,et al.  3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network , 2014, ACCV.

[5]  Francesc Moreno-Noguer,et al.  A Joint Model for 2D and 3D Pose Estimation from a Single Image , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Vincent Lepetit,et al.  Direct Prediction of 3D Body Poses from Motion Compensated Sequences , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[10]  Christian Vollmer,et al.  Estimation of human upper body orientation for mobile robotics using an SVM decision tree on monocular images , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[11]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[12]  Ilya Kostrikov,et al.  Depth Sweep Regression Forests for Estimating 3D Human Pose from Images , 2014, BMVC.

[13]  Cristian Sminchisescu,et al.  Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  Youjie Zhou,et al.  Pose Locality Constrained Representation for 3D Human Pose Reconstruction , 2014, ECCV.

[16]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Andrew Zisserman,et al.  Return of the Devil in the Details: Delving Deep into Convolutional Nets , 2014, BMVC.

[18]  Simone Re,et al.  Ideas and methods for modeling 3D human figures: the principal algorithms used by MakeHuman and their implementation in a new approach to parametric modeling , 2008, Bangalore Compute Conf..

[19]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[20]  T. Kanade,et al.  Reconstructing 3D Human Pose from 2D Image Landmarks , 2012, ECCV.

[21]  Leonidas J. Guibas,et al.  Render for CNN: Viewpoint Estimation in Images Using CNNs Trained with Rendered 3D Model Views , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Cristian Sminchisescu,et al.  Iterated Second-Order Label Sensitive Pooling for 3D Human Pose Estimation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Bastian Leibe,et al.  Person Attribute Recognition with a Jointly-Trained Holistic CNN Model , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[24]  Zhenhua Wang,et al.  Synthesizing Training Images for Boosting Human 3D Pose Estimation , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[25]  Jitendra Malik,et al.  Human Pose Estimation with Iterative Error Feedback , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Michael J. Black,et al.  HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion , 2010, International Journal of Computer Vision.

[27]  Juergen Gall,et al.  A Dual-Source Approach for 3D Pose Estimation from a Single Image , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Alexei A. Efros,et al.  Seeing 3D Chairs: Exemplar Part-Based 2D-3D Alignment Using a Large Dataset of CAD Models , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Thomas Brox,et al.  Learning to Generate Chairs, Tables and Cars with Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Michael J. Black,et al.  Pose-conditioned joint angle limits for 3D human pose reconstruction , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Roland Göcke,et al.  Monocular Image 3D Human Pose Estimation under Self-Occlusion , 2013, 2013 IEEE International Conference on Computer Vision.