DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare

We present DenseRaC, a novel end-to-end framework for jointly estimating 3D human pose and body shape from a monocular RGB image. Our two-step framework takes the body pixel-to-surface correspondence map (i.e., IUV map) as proxy representation and then performs estimation of parameterized human pose and shape. Specifically, given an estimated IUV map, we develop a deep neural network optimizing 3D body reconstruction losses and further integrating a render-and-compare scheme to minimize differences between the input and the rendered output, i.e., dense body landmarks, body part masks, and adversarial priors. To boost learning, we further construct a large-scale synthetic dataset (MOCA) utilizing web-crawled Mocap sequences, 3D scans and animations. The generated data covers diversified camera views, human actions and body shapes, and is paired with full ground truth. Our model jointly learns to represent the 3D human body from hybrid datasets, mitigating the problem of unpaired training data. Our experiments show that DenseRaC obtains superior performance against state of the art on public benchmarks of various human-related tasks.

[1]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Xiaowei Zhou,et al.  Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Markus H. Gross,et al.  HS-Nets: Estimating Human Body Shape from Silhouettes with Convolutional Neural Networks , 2016, 2016 Fourth International Conference on 3D Vision (3DV).

[5]  Yichen Wei,et al.  Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Song-Chun Zhu,et al.  Monocular 3D Human Pose Estimation by Predicting Depth on Joints , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[8]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[10]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Juergen Gall,et al.  A Dual-Source Approach for 3D Pose Estimation from a Single Image , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  James J. Little,et al.  Exploiting Temporal Information for 3D Human Pose Estimation , 2017, ECCV.

[14]  James M. Rehg,et al.  3D-RCNN: Instance-Level 3D Object Reconstruction via Render-and-Compare , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Daniel Cremers,et al.  DeepWrinkles: Accurate and Realistic Clothing Modeling , 2018, ECCV.

[16]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Cristian Sminchisescu,et al.  Latent structured models for human pose estimation , 2011, 2011 International Conference on Computer Vision.

[18]  Yichen Wei,et al.  Compositional Human Pose Regression , 2018, Comput. Vis. Image Underst..

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Trevor Darrell,et al.  Fast pose estimation with parameter-sensitive hashing , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[21]  Pascal Fua,et al.  Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation , 2018, ECCV.

[22]  Peter Shirley,et al.  Fundamentals of computer graphics , 2018 .

[23]  Michael J. Black,et al.  The stitched puppet: A graphical model of 3D human shape and pose , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Antoni B. Chan,et al.  Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[26]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[27]  Ankur Agarwal,et al.  Recovering 3D human pose from monocular images , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[29]  Takashi Matsuyama,et al.  3D Video and Its Applications , 2012, Springer London.

[30]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Tatsuya Harada,et al.  Neural 3D Mesh Renderer , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Takashi Matsuyama,et al.  Complete multi-view reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[33]  Michael J. Black,et al.  DRAPE , 2012, ACM Trans. Graph..

[34]  Liang Lin,et al.  Human Re-identification by Matching Compositional Template with Cluster Sampling , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Cewu Lu,et al.  RMPE: Regional Multi-person Pose Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Michael J. Black,et al.  MoSh: motion and shape capture from sparse markers , 2014, ACM Trans. Graph..

[37]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Ignas Budvytis,et al.  Indirect deep structured learning for 3D human body shape and pose prediction , 2017, BMVC.

[39]  Xiaogang Wang,et al.  3D Human Pose Estimation in the Wild by Adversarial Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Song-Chun Zhu,et al.  Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation , 2017, AAAI.

[41]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[43]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[44]  Mark Everingham,et al.  Learning effective human pose estimation from inaccurate annotation , 2011, CVPR 2011.

[45]  Pascal Fua,et al.  Learning Monocular 3D Human Pose Estimation from Multi-view Images , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Cordelia Schmid,et al.  MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild , 2016, NIPS.

[47]  Song-Chun Zhu,et al.  A Causal And-Or Graph Model for Visibility Fluent Reasoning in Tracking Interacting Objects , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Vincent Lepetit,et al.  Structured Prediction of 3D Human Pose with Deep Neural Networks , 2016, BMVC.

[49]  Iasonas Kokkinos,et al.  DenseReg: Fully Convolutional Dense Shape Regression In-the-Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Tao Yuan,et al.  Scene-Centric Joint Parsing of Cross-View Videos , 2017, AAAI.

[51]  Peter V. Gehler,et al.  Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[52]  Bernt Schiele,et al.  Articulated people detection and pose estimation: Reshaping the future , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[54]  Takeo Kanade,et al.  Learning scene-specific pedestrian detectors without real data , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Ajmal Mian,et al.  3D Action Recognition from Novel Viewpoints , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Song-Chun Zhu,et al.  Attentive Fashion Grammar Network for Fashion Landmark Detection and Clothing Category Classification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[59]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Qing Wu,et al.  Shape-from-Mask: A Deep Learning Based Human Body Shape Reconstruction from Binary Mask Images , 2018, ArXiv.

[62]  Peter V. Gehler,et al.  Unite the People: Closing the Loop Between 3D and 2D Human Representations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Cordelia Schmid,et al.  BodyNet: Volumetric Inference of 3D Human Body Shapes , 2018, ECCV.