HandsFormer: Keypoint Transformer for Monocular 3D Pose Estimation ofHands and Object in Interaction

We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. Our method starts by extracting a set of potential 2D locations for the joints of both hands as extrema of a heatmap. We do not require that all locations correctly correspond to a joint, not that all the joints are detected. We use appearance and spatial encodings of these locations as input to a transformer, and leverage the attention mechanisms to sort out the correct configuration of the joints and output the 3D poses of both hands. Our approach thus allies the recognition power of a Transformer to the accuracy of heatmap-based methods. We also show it can be extended to estimate the 3D pose of an object manipulated by one or two hands. We evaluate our approach on the recent and challenging InterHand2.6M and HO-3D datasets. We obtain 17% improvement over the baseline. Moreover, we introduce the first dataset made of action sequences of two hands manipulating an object fully annotated in 3D and will make it publicly available.

[1]  Yunhui Liu,et al.  Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction , 2020, ECCV.

[2]  Jiayi Wang,et al.  RGB2Hands , 2020, ACM Trans. Graph..

[3]  Yaser Sheikh,et al.  Constraining dense hand surface tracking with elasticity , 2020, ACM Trans. Graph..

[4]  Xiaowei Zhou,et al.  Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Timothy Patten,et al.  Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jitendra Malik,et al.  End-to-End Recovery of Human Shape and Pose , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Marc Pollefeys,et al.  Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation , 2015, International Journal of Computer Vision.

[10]  Tae-Kyun Kim,et al.  Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  David Kim,et al.  Articulated distance fields for ultra-fast tracking of hands interacting , 2017, ACM Trans. Graph..

[12]  Yan Zhang,et al.  Grasping Field: Learning Implicit Representations for Human Grasps , 2020, 2020 International Conference on 3D Vision (3DV).

[13]  Kyoung Mu Lee,et al.  I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image , 2020, ECCV.

[14]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Tae-Kyun Kim,et al.  Weakly-Supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Antonis A. Argyros,et al.  Tracking the articulated motion of two strongly interacting hands , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Yana Hasson,et al.  Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Charles C. Kemp,et al.  ContactPose: A Dataset of Grasps with Object Contact and Hand Pose , 2020, ECCV.

[20]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[21]  Antonis A. Argyros,et al.  Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[22]  Chengde Wan,et al.  MEgATrack , 2020, ACM Trans. Graph..

[23]  Mathieu Aubry,et al.  CosyPose: Consistent multi-view multi-object 6D pose estimation , 2020, ECCV.

[24]  Ersin Yumer,et al.  Self-supervised Learning of Motion Capture , 2017, NIPS.

[25]  Luc Van Gool,et al.  Motion Capture of Hands in Action Using Discriminative Salient Points , 2012, ECCV.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Antonis A. Argyros,et al.  Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints , 2011, 2011 International Conference on Computer Vision.

[28]  J. Kautz,et al.  Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints , 2020, ECCV.

[29]  Ilija Radosavovic,et al.  Reconstructing Hand-Object Interactions in the Wild , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Vincent Lepetit,et al.  HOnnotate: A Method for 3D Annotation of Hand and Object Poses , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[33]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[34]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[35]  Dimitrios Tzionas,et al.  GRAB: A Dataset of Whole-Body Human Grasping of Objects , 2020, ECCV.

[36]  Kyoung Mu Lee,et al.  Camera Distance-Aware Top-Down Approach for 3D Multi-Person Pose Estimation From a Single RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Baining Guo,et al.  Learning Texture Transformer Network for Image Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Antonis A. Argyros,et al.  Scalable 3D Tracking of Multiple Interacting Objects , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[40]  Iasonas Kokkinos,et al.  Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[42]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[43]  Lijuan Wang,et al.  End-to-End Human Pose and Mesh Reconstruction with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[45]  Antonis A. Argyros,et al.  3D Tracking of Human Hands in Interaction with Unknown Objects , 2015, BMVC.

[46]  Vincent Lepetit,et al.  Training a Feedback Loop for Hand Pose Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[47]  Nal Kalchbrenner,et al.  Colorization Transformer , 2021, ICLR.

[48]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[49]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Dimitrios Tzionas,et al.  3D Object Reconstruction from Hand-Object Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  M. Pollefeys,et al.  Unified Egocentric Recognition of 3 D Hand-Object Poses and Interactions , 2019 .

[52]  Yaser Sheikh,et al.  Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[55]  Andrew W. Fitzgibbon,et al.  Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences , 2016, ACM Trans. Graph..

[56]  Marc Pollefeys,et al.  H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Eric Brachmann,et al.  BOP Challenge 2020 on 6D Object Localization , 2020, ECCV Workshops.

[58]  A OtaduyMiguel,et al.  Real-time pose and shape reconstruction of two interacting hands with a single depth camera , 2019 .

[59]  Takaaki Shiratori,et al.  InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image , 2020, ECCV.

[60]  Kostas Daniilidis,et al.  TexturePose: Supervising Human Mesh Estimation With Texture Consistency , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).