Reconstructing Hand-Held Objects from Monocular Video

This paper presents an approach that reconstructs a hand-held object from a monocular video. In contrast to many recent methods that directly predict object geometry by a trained network, the proposed approach does not require any learned prior about the object and is able to recover more accurate and detailed object geometry. The key idea is that the hand motion naturally provides multiple views of the object and the motion can be reliably estimated by a hand pose tracker. Then, the object geometry can be recovered by solving a multi-view reconstruction problem. We devise an implicit neural representation-based method to solve the reconstruction problem and address the issues of imprecise hand pose estimation, relative hand-object motion, and insufficient geometry optimization for small objects. We also provide a newly collected dataset with 3D ground truth to validate the proposed approach. The dataset and code will be released at https://dihuangdh.github.io/hhor.

[1]  A. Gupta,et al.  What's in your hands? 3D Reconstruction of Generic Objects in Hands , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andreas Geiger,et al.  TensoRF: Tensorial Radiance Fields , 2022, ECCV.

[3]  T. Müller,et al.  Instant neural graphics primitives with a multiresolution hash encoding , 2022, ACM Trans. Graph..

[4]  Jun-Hai Yong,et al.  Single Depth View Based Real-Time Reconstruction of Hand-Object Interactions , 2021, ACM Trans. Graph..

[5]  C. Theobalt,et al.  NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction , 2021, NeurIPS.

[6]  Antonio Torralba,et al.  BARF: Bundle-Adjusting Neural Radiance Fields , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Stefan Leutenegger,et al.  In-Place Scene Labelling and Understanding with Implicit Scene Representation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  V. Prisacariu,et al.  NeRF--: Neural Radiance Fields Without Known Camera Parameters , 2021, 2102.07064.

[9]  M. Zollhöfer,et al.  Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Kevin Lin,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ira Kemelmacher-Shlizerman,et al.  Real-Time High-Resolution Background Matting , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jonathan T. Barron,et al.  Nerfies: Deformable Neural Radiance Fields , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Gerard Pons-Moll,et al.  Neural Unsigned Distance Fields for Implicit Function Learning , 2020, NeurIPS.

[14]  Takaaki Shiratori,et al.  InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image , 2020, ECCV.

[15]  Yan Zhang,et al.  Grasping Field: Learning Implicit Representations for Human Grasps , 2020, 2020 International Conference on 3D Vision (3DV).

[16]  Charles C. Kemp,et al.  ContactPose: A Dataset of Grasps with Object Contact and Hand Pose , 2020, ECCV.

[17]  Chengde Wan,et al.  MEgATrack , 2020, ACM Trans. Graph..

[18]  Christopher D. Twigg,et al.  MEgATrack , 2020 .

[19]  Tony Martinez,et al.  Two-hand Global 3D Pose Estimation using Monocular RGB , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[20]  Iasonas Kokkinos,et al.  Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Pratul P. Srinivasan,et al.  NeRF , 2020, ECCV.

[22]  Y. Lipman,et al.  Implicit Geometric Regularization for Learning Shapes , 2020, ICML.

[23]  Kai Xu,et al.  Learning Canonical Shape Space for Category-Level 6D Object Pose and Size Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[25]  Michael J. Black,et al.  Learning to Reconstruct 3D Human Pose and Shape via Model-Fitting in the Loop , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Miguel A. Otaduy,et al.  Real-time pose and shape reconstruction of two interacting hands with a single depth camera , 2019, ACM Trans. Graph..

[28]  Feng Xu,et al.  InteractionFusion: real-time reconstruction of hand poses and deformable objects in hand-object interactions , 2019, ACM Trans. Graph..

[29]  V. Lepetit,et al.  HOnnotate: A Method for 3D Annotation of Hand and Object Poses , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Duygu Ceylan,et al.  DISN: Deep Implicit Surface Network for High-quality Single-view 3D Reconstruction , 2019, NeurIPS.

[31]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Junsong Yuan,et al.  3D Hand Shape and Pose Estimation From a Single RGB Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Richard A. Newcombe,et al.  DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Leonidas J. Guibas,et al.  Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Sebastian Nowozin,et al.  Occupancy Networks: Learning 3D Reconstruction in Function Space , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Hao Zhang,et al.  Learning Implicit Fields for Generative Shape Modeling , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Yaser Sheikh,et al.  Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Alexey Dosovitskiy,et al.  Unsupervised Learning of Shape and Pose with Differentiable Point Clouds , 2018, NeurIPS.

[40]  Xiaojuan Qi,et al.  GAL: Geometric Adversarial Loss for Single-View 3D-Object Reconstruction , 2018, ECCV.

[41]  R. Venkatesh Babu,et al.  3D-LMNet: Latent Embedding Matching for Accurate and Diverse 3D Point Cloud Reconstruction from a Single Image , 2018, BMVC.

[42]  Long Quan,et al.  MVSNet: Depth Inference for Unstructured Multi-view Stereo , 2018, ECCV.

[43]  Wei Liu,et al.  Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images , 2018, ECCV.

[44]  Mathieu Aubry,et al.  A Papier-Mache Approach to Learning 3D Surface Generation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[46]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Michael J. Black,et al.  Embodied Hands : Modeling and Capturing Hands and Bodies Together * * Supplementary Material * * , 2017 .

[49]  Chao Yang,et al.  Shape Inpainting Using 3D Generative Adversarial Network and Recurrent Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[50]  Ulrich Schwanecke,et al.  Real-Time Monocular Pose Estimation of 3D Objects Using Temporally Consistent Local Color Histograms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[51]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[52]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Thomas Brox,et al.  Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[54]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Matthias Nießner,et al.  Shape Completion Using 3D-Encoder-Predictor CNNs and Shape Synthesis , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Gernot Riegler,et al.  OctNet: Learning Deep 3D Representations at High Resolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Jan-Michael Frahm,et al.  Pixelwise View Selection for Unstructured Multi-View Stereo , 2016, ECCV.

[58]  Ulrich Schwanecke,et al.  Real-Time Monocular Segmentation and Pose Tracking of Multiple Objects , 2016, ECCV.

[59]  Andrew W. Fitzgibbon,et al.  Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences , 2016, ACM Trans. Graph..

[60]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Silvio Savarese,et al.  3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction , 2016, ECCV.

[62]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[63]  Dimitrios Tzionas,et al.  3D Object Reconstruction from Hand-Object Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  Ken Perlin,et al.  Real-Time Continuous Pose Recovery of Human Hands Using Convolutional Networks , 2014, ACM Trans. Graph..

[65]  Andrew W. Fitzgibbon,et al.  KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera , 2011, UIST.

[66]  Jean Ponce,et al.  Accurate, Dense, and Robust Multiview Stereopsis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Luc Van Gool,et al.  Accurate and robust registration for in-hand modeling , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Michael M. Kazhdan,et al.  Poisson surface reconstruction , 2006, SGP '06.

[69]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[70]  Marc Levoy,et al.  Real-time 3D model acquisition , 2002, ACM Trans. Graph..

[71]  Kiriakos N. Kutulakos,et al.  A Theory of Shape by Space Carving , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[72]  William E. Lorensen,et al.  Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[73]  Dimitrios Tzionas,et al.  Embodied hands , 2017, ACM Trans. Graph..