SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction

Recent hand-object interaction datasets show limited real object variability and rely on fitting the MANO parametric model to obtain groundtruth hand shapes. To go beyond these limitations and spur further research, we introduce the SHOWMe dataset which consists of 96 videos, annotated with real and detailed hand-object 3D textured meshes. Following recent work, we consider a rigid hand-object scenario, in which the pose of the hand with respect to the object remains constant during the whole video sequence. This assumption allows us to register sub-millimetre-precise groundtruth 3D scans to the image sequences in SHOWMe. Although simpler, this hypothesis makes sense in terms of applications where the required accuracy and level of detail is important eg., object hand-over in human-robot collaboration, object scanning, or manipulation and contact point analysis. Importantly, the rigidity of the hand-object systems allows to tackle video-based 3D reconstruction of unknown hand-held objects using a 2-stage pipeline consisting of a rigid registration step followed by a multi-view reconstruction (MVR) part. We carefully evaluate a set of non-trivial baselines for these two stages and show that it is possible to achieve promising object-agnostic 3D hand-object reconstructions employing an SfM toolbox or a hand pose estimator to recover the rigid transforms and off-the-shelf MVR algorithms. However, these methods remain sensitive to the initial camera pose estimates which might be imprecise due to lack of textures on the objects or heavy occlusions of the hands, leaving room for improvements in the reconstruction. Code and dataset are available at https://europe.naverlabs.com/research/showme

[1]  Wanli Ouyang,et al.  Reconstructing Hand-Held Objects from Monocular Video , 2022, SIGGRAPH Asia.

[2]  Jean-Sébastien Franco,et al.  Fast Gradient Descent for Surface Capture Via Differentiable Rendering , 2022, 2022 International Conference on 3D Vision (3DV).

[3]  Fabien Baradel,et al.  PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  C. Schmid,et al.  AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction , 2022, ECCV.

[5]  A. Torralba,et al.  Virtual Correspondence: Humans as a Cue for Extreme-View Geometry , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tze Ho Elden Tse,et al.  Collaborative Learning for Hand and Object Reconstruction with Attention-guided Graph Convolution , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  A. Gupta,et al.  What's in your hands? 3D Reconstruction of Generic Objects in Hands , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Richard A. Newcombe,et al.  LISA: Learning Implicit Shape and Appearance of Hands , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Cewu Lu,et al.  OakInk: A Large-scale Knowledge Repository for Understanding Hand-Object Interaction , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Kyoung Mu Lee,et al.  HandOccNet: Occlusion-Robust 3D Hand Mesh Estimation Network , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Benjamin Recht,et al.  Plenoxels: Radiance Fields without Neural Networks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  C. Schmid,et al.  Towards Unconstrained Joint Hand-Object Reconstruction From RGB Videos , 2021, 2021 International Conference on 3D Vision (3DV).

[13]  Takaaki Shiratori,et al.  FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[14]  Jun-Hai Yong,et al.  Single Depth View Based Real-Time Reconstruction of Hand-Object Interactions , 2021, ACM Trans. Graph..

[15]  Vincent Lepetit,et al.  HO-3D_v3: Improving the Accuracy of Hand-Object Annotations of the HO-3D Dataset , 2021, ArXiv.

[16]  C. Theobalt,et al.  NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction , 2021, NeurIPS.

[17]  Xiaolong Wang,et al.  Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Marc Pollefeys,et al.  H2O: Two Hands Manipulating Objects for First Person Interaction Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Yashraj S. Narang,et al.  DexYCB: A Benchmark for Capturing Hand Grasping of Objects , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Romain Br'egier,et al.  Deep Regression on Manifolds: A 3D Rotation Case Study , 2021, 2021 International Conference on 3D Vision (3DV).

[21]  Ilija Radosavovic,et al.  Reconstructing Hand-Object Interactions in the Wild , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Edmond Boyer,et al.  Volume Sweeping: Learning Photoconsistency for Multi-View Shape Reconstruction , 2020, International Journal of Computer Vision.

[23]  Dimitrios Tzionas,et al.  GRAB: A Dataset of Whole-Body Human Grasping of Objects , 2020, ECCV.

[24]  Vincent Leroy,et al.  DOPE: Distillation Of Part Experts for whole-body 3D pose estimation in the wild , 2020, ECCV.

[25]  Yan Zhang,et al.  Grasping Field: Learning Implicit Representations for Human Grasps , 2020, 2020 International Conference on 3D Vision (3DV).

[26]  Charles C. Kemp,et al.  ContactPose: A Dataset of Grasps with Object Contact and Hand Pose , 2020, ECCV.

[27]  Bailin Deng,et al.  Fast and Robust Iterative Closest Point , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Francesc Moreno-Noguer,et al.  GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yana Hasson,et al.  Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yunhui Liu,et al.  Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction , 2020, ECCV.

[31]  C. Theobalt,et al.  Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[33]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  V. Lepetit,et al.  HOnnotate: A Method for 3D Annotation of Hand and Object Poses , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Michael J. Black,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Vincent Lepetit,et al.  Generalized Feedback Loop for Joint Hand-Object Pose Estimation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Danica Kragic,et al.  Learning to Estimate Pose and Shape of Hand-Held Objects from RGB Images , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[38]  Sang Ho Yoon,et al.  Robust Hand Pose Estimation during the Interaction with an Unknown Object , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  C. Theobalt,et al.  Real-Time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[42]  Jan-Michael Frahm,et al.  Structure-from-Motion Revisited , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Deva Ramanan,et al.  Understanding Everyday Hands in Action from RGB-D Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  David J. Crandall,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Deva Ramanan,et al.  First-person pose recognition using egocentric workspaces , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[47]  Yoichi Sato,et al.  A scalable approach for understanding the visual structures of hand grasps , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[48]  Aaron M. Dollar,et al.  The Yale human grasping dataset: Grasp, object, and task data in household and machine shop environments , 2015, Int. J. Robotics Res..

[49]  P. Abbeel,et al.  Benchmarking in Manipulation Research: The YCB Object and Model Set and Benchmarking Protocols , 2015, ArXiv.

[50]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[51]  Juergen Gall,et al.  Capturing Hand Motion with an RGB-D Sensor, Fusing a Generative Model with Salient Points , 2014, GCPR.

[52]  Luc Van Gool,et al.  Motion Capture of Hands in Action Using Discriminative Salient Points , 2012, ECCV.

[53]  Antonis A. Argyros,et al.  Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints , 2011, 2011 International Conference on Computer Vision.

[54]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[55]  Michael J. Black,et al.  Articulated Objects in Free-form Hand Interaction , 2022, ArXiv.

[56]  Vincent Lepetit,et al.  HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction , 2021, ArXiv.

[57]  Christian Theobalt,et al.  HTML: A Parametric Hand Texture Model for 3D Hand Reconstruction and Personalization , 2020, ECCV.

[58]  Michael J. Black,et al.  Embodied Hands : Modeling and Capturing Hands and Bodies Together * * Supplementary Material * * , 2017 .