Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes

We present Implicit Two Hands (Im2Hands), the first neural implicit representation of two interacting hands. Unlike existing methods on two-hand reconstruction that rely on a parametric hand model and/or low-resolution meshes, Im2Hands can produce fine-grained geometry of two hands with high hand-to-hand and hand-to-image coherency. To handle the shape complexity and interaction context between two hands, Im2Hands models the occupancy volume of two hands - conditioned on an RGB image and coarse 3D keypoints - by two novel attention-based modules responsible for (1) initial occupancy estimation and (2) context-aware occupancy refinement, respectively. Im2Hands first learns per-hand neural articulated occupancy in the canonical space designed for each hand using query-image attention. It then refines the initial two-hand occupancy in the posed space to enhance the coherency between the two hand shapes using query-anchor attention. In addition, we introduce an optional keypoint refinement module to enable robust two-hand shape estimation from predicted hand keypoints in a single-image reconstruction scenario. We experimentally demonstrate the effectiveness of Im2Hands on two-hand reconstruction in comparison to related methods, where ours achieves state-of-the-art results. Our code is publicly available at https://github.com/jyunlee/Im2Hands.

[1]  Aayush Bansal,et al.  COAP: Compositional Articulated Occupancy of People , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Richard A. Newcombe,et al.  LISA: Learning Implicit Shape and Appearance of Hands , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tao Yu,et al.  Interacting Attention Graph for Single Image Two-Hand Reconstruction , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Chen Change Loy,et al.  Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements , 2021, 2021 International Conference on 3D Vision (3DV).

[5]  Bastian Goldlücke,et al.  AIR-Nets: An Attention-Based Framework for Locally Conditioned Implicit Representations , 2021, 2021 International Conference on 3D Vision (3DV).

[6]  P. Tan,et al.  Interacting Two-Hand 3D Pose and Shape Reconstruction from Single Color Image , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  K. Kim,et al.  End-to-End Detection and Pose Estimation of Two Interacting Hands , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Adrian Spurr,et al.  A Skeleton-Driven Neural Occupancy Representation for Articulated Hands , 2021, 2021 International Conference on 3D Vision (3DV).

[9]  Michael J. Black,et al.  Learning To Disambiguate Strongly Interacting Hands via Probabilistic Per-Pixel Part Segmentation , 2021, 2021 International Conference on 3D Vision (3DV).

[10]  Yaron Lipman,et al.  Volume Rendering of Neural Implicit Surfaces , 2021, NeurIPS.

[11]  Stephen Lin,et al.  Neural Articulated Radiance Field , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Michael J. Black,et al.  SCANimate: Weakly Supervised Learning of Skinned Clothed Avatar Networks , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Kevin Lin,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ilija Radosavovic,et al.  Reconstructing Hand-Object Interactions in the Wild , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Klaus Dietmayer,et al.  Point Transformer , 2020, IEEE Access.

[16]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[17]  Takaaki Shiratori,et al.  InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image , 2020, ECCV.

[18]  Yan Zhang,et al.  Grasping Field: Learning Implicit Representations for Human Grasps , 2020, 2020 International Conference on 3D Vision (3DV).

[19]  Tae-Kyun Kim,et al.  Weakly-Supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yana Hasson,et al.  Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Iasonas Kokkinos,et al.  Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  David J. Crandall,et al.  HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  C. Theobalt,et al.  Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  J. Kautz,et al.  Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints , 2020, ECCV.

[25]  Geoffrey E. Hinton,et al.  NASA: Neural Articulated Shape Approximation , 2019, ECCV.

[26]  Luc Van Gool,et al.  Dual Grid Net: hand mesh vertex regression from single depth maps , 2019, ECCV.

[27]  Miguel A. Otaduy,et al.  Real-time pose and shape reconstruction of two interacting hands with a single depth camera , 2019, ACM Trans. Graph..

[28]  Hao Li,et al.  PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Tae-Kyun Kim,et al.  Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Junsong Yuan,et al.  3D Hand Shape and Pose Estimation From a Single RGB Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jianfei Cai,et al.  Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images , 2018, ECCV.

[34]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[35]  Kyoung Mu Lee,et al.  V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  David Kim,et al.  Articulated distance fields for ultra-fast tracking of hands interacting , 2017, ACM Trans. Graph..

[37]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Daniel Thalmann,et al.  Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Stefan Lee,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Antonis A. Argyros,et al.  Scalable 3D Tracking of Multiple Interacting Objects , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Antonis A. Argyros,et al.  Tracking the articulated motion of two strongly interacting hands , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Ying Wu,et al.  Analyzing and capturing articulated hand motion in image sequences , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  William E. Lorensen,et al.  Marching cubes: A high resolution 3D surface construction algorithm , 1987, SIGGRAPH.

[46]  Dimitrios Tzionas,et al.  Embodied hands , 2017, ACM Trans. Graph..