POEM: Reconstructing Hand in a Point Embedded Multi-view Stereo

Enable neural networks to capture 3D geometrical-aware features is essential in multi-view based vision tasks. Previous methods usually encode the 3D information of multi-view stereo into the 2D features. In contrast, we present a novel method, named POEM, that directly operates on the 3D POints Embedded in the Multi-view stereo for reconstructing hand mesh in it. Point is a natural form of 3D information and an ideal medium for fusing features across views, as it has different projections on different views. Our method is thus in light of a simple yet effective idea, that a complex 3D hand mesh can be represented by a set of 3D points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encircle the hand. To leverage the power of points, we design two operations: point-based feature fusion and cross-set point attention mechanism. Evaluation on three challenging multi-view datasets shows that POEM outperforms the state-of-the-art in hand mesh reconstruction. Code and models are available for research at https://github.com/lixiny/POEM.

[1]  Robert Y. Wang,et al.  UmeTrack: Unified multi-view end-to-end hand tracking for VR , 2022, SIGGRAPH Asia.

[2]  Linguang Zhang,et al.  Identity-Aware Hand Mesh Estimation and Personalization from RGB Images , 2022, ECCV.

[3]  C. Schmid,et al.  AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction , 2022, ECCV.

[4]  Cewu Lu,et al.  OakInk: A Large-scale Knowledge Repository for Understanding Hand-Object Interaction , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Kyoung Mu Lee,et al.  HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jian Sun,et al.  PETR: Position Embedding Transformation for Multi-View 3D Object Detection , 2022, ECCV.

[7]  Shuicheng Yan,et al.  Direct Multi-view Multi-person 3D Pose Estimation , 2021, NeurIPS.

[8]  Xiaolong Wang,et al.  DexMV: Imitation Learning for Dexterous Manipulation from Human Videos , 2021, ECCV.

[9]  Vincent Lepetit,et al.  HO-3D_v3: Improving the Accuracy of Hand-Object Annotations of the HO-3D Dataset , 2021, ArXiv.

[10]  V. Lepetit,et al.  Keypoint Transformer: Solving Joint Identification in Challenging Hands and Object Interactions for Accurate 3D Pose Estimation , 2021, Computer Vision and Pattern Recognition.

[11]  Yashraj S. Narang,et al.  DexYCB: A Benchmark for Capturing Hand Grasping of Objects , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jiahao Lin,et al.  Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Lijuan Wang,et al.  Mesh Graphormer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Yufeng Liu,et al.  Camera-Space Hand Mesh Recovery via Semantic Aggregation and Adaptive 2D-1D Registration , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Qin Li,et al.  I2UV-HandNet: Image-to-UV Prediction Network for Accurate and High-fidelity 3D Hand Mesh Modeling , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Kevin Lin,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ralph R. Martin,et al.  PCT: Point cloud transformer , 2020, Computational Visual Media.

[18]  Cewu Lu,et al.  CPF: Learning a Contact Potential Field to Model the Hand-Object Interaction , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Klaus Dietmayer,et al.  Point Transformer , 2020, IEEE Access.

[20]  Yan Zhang,et al.  Grasping Field: Learning Implicit Representations for Human Grasps , 2020, 2020 International Conference on 3D Vision (3DV).

[21]  Kyoung Mu Lee,et al.  I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image , 2020, ECCV.

[22]  Charles C. Kemp,et al.  ContactPose: A Dataset of Grasps with Object Contact and Hand Pose , 2020, ECCV.

[23]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[24]  Katerina Fragkiadaki,et al.  Epipolar Transformers , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Vladlen Koltun,et al.  Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Wenjun Zeng,et al.  VoxelPose: Towards Multi-camera 3D Human Pose Estimation in Wild Environment , 2020, ECCV.

[27]  Pascal Fua,et al.  Lightweight Multi-View 3D Pose Estimation Through Camera-Disentangled Representation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Iasonas Kokkinos,et al.  Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  C. Theobalt,et al.  Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Christoph Lassner,et al.  Efficient Learning on Point Clouds With Basis Point Sets , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Jing Xu,et al.  Point-Based Multi-View Stereo Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Wenjun Zeng,et al.  Cross View Fusion for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  V. Lepetit,et al.  HOnnotate: A Method for 3D Annotation of Hand and Object Poses , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Victor Lempitsky,et al.  Learnable Triangulation of Human Pose , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Kostas Daniilidis,et al.  Convolutional Mesh Regression for Single-Image Human Shape Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Leonidas J. Guibas,et al.  KPConv: Flexible and Deformable Convolution for Point Clouds , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Junsong Yuan,et al.  3D Hand Shape and Pose Estimation From a Single RGB Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Wei Wu,et al.  PointCNN: Convolution On X-Transformed Points , 2018, NeurIPS.

[42]  Fuxin Li,et al.  PointConv: Deep Convolutional Networks on 3D Point Clouds , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[44]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[45]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Lu Fang,et al.  SurfaceNet: An End-to-End 3D Neural Network for Multiview Stereopsis , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Jitendra Malik,et al.  Learning a Multi-View Stereo Machine , 2017, NIPS.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[50]  Xiaowei Zhou,et al.  Harvesting Multiple Views for Marker-Less 3D Human Pose Annotations , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[55]  Dimitrios Tzionas,et al.  Embodied hands , 2017, ACM Trans. Graph..

[56]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .