Exploiting Learnable Joint Groups for Hand Pose Estimation

In this paper, we propose to estimate 3D hand pose by recovering the 3D coordinates of joints in a group-wise manner, where less-related joints are automatically categorized into different groups and exhibit different features. This is different from the previous methods where all the joints are considered holistically and share the same feature. The benefits of our method are illustrated by the principle of multi-task learning (MTL), i.e., by separating less-related joints into different groups (as different tasks), our method learns different features for each of them, therefore efficiently avoids the negative transfer (among less related tasks/groups of joints). The key of our method is a novel binary selector that automatically selects related joints into the same group. We implement such a selector with binary values stochastically sampled from a Concrete distribution, which is constructed using Gumbel softmax on trainable parameters. This enables us to preserve the differentiable property of the whole network. We further exploit features from those less-related groups by carrying out an additional feature fusing scheme among them, to learn more discriminative features. This is realized by implementing multiple 1x1 convolutions on the concatenated features, where each joint group contains a unique 1x1 convolution for feature fusion. The detailed ablation analysis and the extensive experiments on several benchmark datasets demonstrate the promising performance of the proposed method over the state-of-the-art (SOTA) methods. Besides, our method achieves top-1 among all the methods that do not exploit the dense 3D shape labels on the most recently released FreiHAND competition at the submission date. The source code and models are available at this https URL moranli-aca/LearnableGroups-Hand.

[1]  Tae-Kyun Kim,et al.  Weakly-Supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Kyoung Mu Lee,et al.  V2V-PoseNet: Voxel-to-Voxel Prediction Network for Accurate 3D Hand and Human Pose Estimation from a Single Depth Map , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Dimitrios Tzionas,et al.  Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[4]  Ying Wu,et al.  Does Learning Specific Features for Related Parts Help Human Pose Estimation? , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Angela Yao,et al.  Aligning Latent Spaces for 3D Hand Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Mingliang Chen,et al.  A hand pose tracking benchmark from stereo matching , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[7]  Didier Stricker,et al.  DeepHPS: End-to-end Estimation of 3D Hand Pose and Shape by Learning from Synthetic Depth , 2018, 2018 International Conference on 3D Vision (3DV).

[8]  Duc Thanh Nguyen,et al.  JSIS3D: Joint Semantic-Instance Segmentation of 3D Point Clouds With Multi-Task Pointwise Networks and Multi-Value Conditional Random Fields , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Thomas Brox,et al.  FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Junsong Yuan,et al.  Hand PointNet: 3D Hand Pose Estimation Using Point Sets , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Angela Yao,et al.  Disentangling Latent Hands for Image Synthesis and Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Bo Liu,et al.  Constrained-size Tensorflow Models for YouTube-8M Video Understanding Challenge , 2018, ECCV Workshops.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Qiang Li,et al.  End-to-End Hand Mesh Recovery From a Monocular RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Philip H. S. Torr,et al.  3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Thomas Brox,et al.  Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[18]  Yichen Wei,et al.  Integral Human Pose Regression , 2017, ECCV.

[19]  Jianfei Cai,et al.  3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[20]  Wei Liu,et al.  MTL-NAS: Task-Agnostic Neural Architecture Search Towards General-Purpose Multi-Task Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Sergio Escalera,et al.  End-to-end Global to Local CNN Learning for Hand Pose Recovery in Depth data , 2017, ArXiv.

[22]  J. Kautz,et al.  Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints , 2020, ECCV.

[23]  Antti Oulasvirta,et al.  Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[24]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[25]  Yi Sun,et al.  CrossInfoNet: Multi-Task Information Sharing Based Hand Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Andrew J. Davison,et al.  End-To-End Multi-Task Learning With Attention , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Guijin Wang,et al.  Pose Guided Structured Region Ensemble Network for Cascaded Hand Pose Estimation , 2017, Neurocomputing.

[28]  Nadia Magnenat-Thalmann,et al.  Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Ehsan Jahangiri,et al.  Generating Multiple Diverse Hypotheses for Human 3D Pose Consistent with 2D Joint Detections , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[30]  Christian Theobalt,et al.  GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Petros Daras,et al.  Cross-modal Variational Alignment of Latent Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32]  Otmar Hilliges,et al.  Cross-Modal Deep Variational Hand Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Zhao Chen,et al.  GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks , 2017, ICML.

[34]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[35]  David J. Crandall,et al.  HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .

[37]  Wei Liu,et al.  NDDR-CNN: Layerwise Feature Fusing in Multi-Task CNNs by Neural Discriminative Dimensionality Reduction , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Cordelia Schmid,et al.  Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Wonhee Lee,et al.  Multi-Task Self-Supervised Object Detection via Recycling of Bounding Box Annotations , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jianfei Cai,et al.  Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images , 2018, ECCV.

[41]  Yi Sun,et al.  HBE: Hand Branch Ensemble Network for Real-Time 3D Hand Pose Estimation , 2018, ECCV.

[42]  Pavlo Molchanov,et al.  Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[43]  Iasonas Kokkinos,et al.  Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Luc Van Gool,et al.  Dense 3D Regression for Hand Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Junsong Yuan,et al.  Point-to-Point Regression PointNet for 3D Hand Pose Estimation , 2018, ECCV.

[46]  Antonis A. Argyros,et al.  Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[47]  Martial Hebert,et al.  Cross-Stitch Networks for Multi-task Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Bin Yang,et al.  Multi-Task Multi-Sensor Fusion for 3D Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Tae-Kyun Kim,et al.  Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Yang Xiao,et al.  A2J: Anchor-to-Joint Regression Network for 3D Articulated Pose Estimation From a Single Depth Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).