VoteHMR: Occlusion-Aware Voting Network for Robust 3D Human Mesh Recovery from Partial Point Clouds

3D human mesh recovery from point clouds is essential for various tasks, including AR/VR and human behavior understanding. Previous works in this field either require high-quality 3D human scans or sequential point clouds, which cannot be easily applied to low-quality 3D scans captured by consumer-level depth sensors. In this paper, we make the first attempt to reconstruct reliable 3D human shapes from single-frame partial point clouds. To achieve this, we propose an end-to-end learnable method, named VoteHMR. The core of VoteHMR is a novel occlusion-aware voting network that can first reliably produce visible joint-level features from the input partial point clouds, and then complete the joint-level features through the kinematic tree of the human skeleton. Compared with holistic features used by previous works, the joint-level features can not only effectively encode the human geometry information but also be robust to noisy inputs with self-occlusions and missing areas. By exploiting the rich complementary clues from the joint-level features and global features from the input point clouds, the proposed method encourages reliable and disentangled parameter predictions for statistical 3D human models, such as SMPL. The proposed method achieves state-of-the-art performances on two large-scale datasets, namely SURREAL and DFAUST. Furthermore, VoteHMR also demonstrates superior generalization ability on real-world datasets, such as Berkeley MHAD.

[1]  Dimitrios Tzionas,et al.  Expressive Body Capture: 3D Hands, Face, and Body From a Single Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Tao Yu,et al.  HybridFusion: Real-Time Performance Capture Using a Single Depth Sensor and Sparse IMUs , 2018, ECCV.

[3]  Michael J. Black,et al.  Dynamic FAUST: Registering Human Bodies in Motion , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Bo Dai,et al.  Scene-aware Generative Network for Human Motion Synthesis , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Dragomir Anguelov,et al.  SCAPE: shape completion and animation of people , 2005, ACM Trans. Graph..

[6]  Wei Sun,et al.  PVN3D: A Deep Point-Wise 3D Keypoints Voting Network for 6DoF Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Kostas Daniilidis,et al.  Convolutional Mesh Regression for Single-Image Human Shape Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Takaaki Shiratori,et al.  FrankMocap: Fast Monocular 3D Hand and Body Motion Capture by Regression and Integration , 2020, ArXiv.

[11]  Li Jiang,et al.  PointGroup: Dual-Set Point Grouping for 3D Instance Segmentation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Mathieu Aubry,et al.  3D-CODED: 3D Correspondences by Deep Deformation , 2018, ECCV.

[13]  Zhenan Sun,et al.  Learning 3D Human Shape and Pose From Dense Body Parts , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Qionghai Dai,et al.  Performance Capture of Interacting Characters with Handheld Kinects , 2012, ECCV.

[15]  Guofeng Zhang,et al.  Sequential 3D Human Pose and Shape Estimation From Point Clouds , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  P.V.C. Hough,et al.  Machine Analysis of Bubble Chamber Pictures , 1959 .

[18]  Sergey Levine,et al.  AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control , 2021, ACM Trans. Graph..

[19]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[20]  Leonidas J. Guibas,et al.  Deep Hough Voting for 3D Object Detection in Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Dong Xu,et al.  Back-tracing Representative Points for Voting-based 3D Object Detection in Point Clouds , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Sebastian Thrun,et al.  SCAPE: shape completion and animation of people , 2005, SIGGRAPH 2005.

[23]  Tao Yu,et al.  BodyFusion: Real-Time Capture of Human Motion and Surface Geometry Using a Single Depth Camera , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[26]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Qionghai Dai,et al.  DoubleFusion: Real-Time Capture of Human Performances with Inner Body Shapes from a Single Depth Sensor , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Matthias Nießner,et al.  VolumeDeform: Real-Time Volumetric Non-rigid Reconstruction , 2016, ECCV.

[31]  Tao Yu,et al.  DeepHuman: 3D Human Reconstruction From a Single Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Junsong Yuan,et al.  Point-to-Point Regression PointNet for 3D Hand Pose Estimation , 2018, ECCV.

[33]  Peter V. Gehler,et al.  Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation , 2018, 2018 International Conference on 3D Vision (3DV).

[34]  Iasonas Kokkinos,et al.  HoloPose: Holistic 3D Human Reconstruction In-The-Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Cordelia Schmid,et al.  BodyNet: Volumetric Inference of 3D Human Body Shapes , 2018, ECCV.

[36]  Jianfei Cai,et al.  Skeleton-Aware 3D Human Shape Reconstruction From Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Edmond Boyer,et al.  FeaStNet: Feature-Steered Graph Convolutions for 3D Shape Analysis , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Xiaowei Zhou,et al.  Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Hujun Bao,et al.  PVNet: Pixel-Wise Voting Network for 6DoF Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Dieter Fox,et al.  DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Jun Wang,et al.  MLCVNet: Multi-Level Context VoteNet for 3D Object Detection , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[43]  Andrew W. Fitzgibbon,et al.  3D scanning deformable objects with a single RGBD sensor , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Dahua Lin,et al.  Motion Guided 3D Pose Estimation from Videos , 2020, ECCV.

[45]  Ruigang Yang,et al.  Detailed Human Shape Estimation From a Single Image by Hierarchical Mesh Deformation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Cheng Li,et al.  Delving Deep Into Hybrid Annotations for 3D Human Recovery in the Wild , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Michael J. Black,et al.  SMPL: A Skinned Multi-Person Linear Model , 2023 .

[49]  Michael J. Black,et al.  STAR: Sparse Trained Articulated Human Body Regressor , 2020, ECCV.

[50]  Cordelia Schmid,et al.  Learning from Synthetic Humans , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jitendra Malik,et al.  Learning 3D Human Dynamics From Video , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Yue Wang,et al.  Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[53]  Qi-Xing Huang,et al.  Dense Human Body Correspondences Using Convolutional Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Sergey Levine,et al.  Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow , 2018, ICLR.

[55]  Chen Change Loy,et al.  Chasing the Tail in Monocular 3D Human Reconstruction With Prototype Memory , 2020, IEEE Transactions on Image Processing.

[56]  Michael J. Black,et al.  VIBE: Video Inference for Human Body Pose and Shape Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[58]  Ruigang Yang,et al.  Real-Time Simultaneous Pose and Shape Estimation for Articulated Objects Using a Single Depth Camera , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  Michael J. Black,et al.  Detailed Full-Body Reconstructions of Moving People from Monocular RGB-D Sequences , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[60]  Pushmeet Kohli,et al.  Fusion4D , 2016, ACM Trans. Graph..