论文信息 - Hand Image Understanding via Deep Multi-Task Learning

Hand Image Understanding via Deep Multi-Task Learning

Stem Module. The stem module consists of two 7× 7 convolutional layers with stride 2, and the channels are set to 64 and 128, respectively. Encoder. We employ the main-body of ResNet-50 [5] to implement the encoder. Specifically, the beginning conv1 together with the prediction head are removed, while the remaining conv2 x, conv3 x, conv4 x, and conv5 x are adopted to build the encoder module, and the number of repetitions are 3,4,5, and 6, respectively. Heat-Map Decoder. The heat-map decoder estimates the feature maps hms ft ∈ R256×64×64 to capture semantic features to encode the 2D hand pose. Similar with [10], the decoder also estimates the hms ∈ R21×128×128, based on the hms ft, to represent the locations of 21 hand key points, and the hms is used for intermediatesupervision. Skip connections between the encoder and the heat-map decoder are also adopted to favor the learning procedure. Mask Decoder. The target of the mask decoder is to estimate the feature maps mask ft ∈ R256×64×64 to capture semantic features that encode the hand segmentation mask. Similar to the heat-map decoder, the mask decoder also estimates the hand segmentation mask ∈ R1×256×256 based on the mask ft, and the mask is used for intermediate-supervision. Skip connections between the encoder and the mask decoder are also adopted to favor the learning. POF Decoder. The POF decoder aims to estimate the feature maps pof ft ∈ R256×64×64 to capture semantic features that encode 3D POF encoding. Similar to the

[1] Cordelia Schmid,et al. LCR-Net: Localization-Classification-Regression for Human Pose , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Thomas Brox,et al. FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape From Single RGB Images , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Antti Oulasvirta,et al. Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input , 2016, ECCV.

[4] Angela Yao,et al. Aligning Latent Spaces for 3D Hand Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5] Varun Ramakrishna,et al. Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Daniel Thalmann,et al. Robust 3D Hand Pose Estimation in Single Depth Images: From Single-View CNN to Multi-View CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Andrew W. Fitzgibbon,et al. Learning an efficient model of hand shape variation from depth images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Chen Qian,et al. Realtime and Robust Hand Tracking from Depth , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Thomas Brox,et al. Learning to Estimate 3D Hand Pose from Single RGB Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10] Dongheui Lee,et al. Point-To-Pose Voting Based Hand Pose Estimation Using Residual Permutation Equivariant Layer , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Sylvain Paris,et al. 6D hands: markerless hand-tracking for computer aided design , 2011, UIST.

[12] Ross B. Girshick,et al. Fast R-CNN , 2015, 1504.08083.

[13] Dimitrios Tzionas,et al. Embodied Hands: Modeling and Capturing Hands and Bodies Together , 2022, ArXiv.

[14] Christian Theobalt,et al. Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Tae-Kyun Kim,et al. Opening the Black Box: Hierarchical Sampling Optimization for Estimating Human Hand Pose , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16] Jian Yang,et al. Selective Kernel Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Kyoung Mu Lee,et al. I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image , 2020, ECCV.

[18] Hyung Jin Chang,et al. SeqHAND: RGB-Sequence-Based 3D Hand Pose and Shape Estimation , 2020, ECCV.

[19] Cordelia Schmid,et al. Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Antonis A. Argyros,et al. Efficient model-based 3D tracking of hand articulations using Kinect , 2011, BMVC.

[21] Yi Sun,et al. CrossInfoNet: Multi-Task Information Sharing Based Hand Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Wan-Yen Lo,et al. Accelerating 3D deep learning with PyTorch3D , 2019, SIGGRAPH Asia 2020 Courses.

[23] Jianfei Cai,et al. Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images , 2018, ECCV.

[24] Hui Cheng,et al. Recurrent 3D Pose Sequence Machines , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Antti Oulasvirta,et al. Investigating the Dexterity of Multi-Finger Input for Mid-Air Text Entry , 2015, CHI.

[26] Luc Van Gool,et al. Motion Capture of Hands in Action Using Discriminative Salient Points , 2012, ECCV.

[27] Paul L. Rosin,et al. Pose2Seg: Detection Free Human Instance Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Jason Weston,et al. A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[29] Woontack Woo,et al. 3D Finger CAPE: Clicking Action and Position Estimation under Self-Occlusions in Egocentric Viewpoint , 2015, IEEE Transactions on Visualization and Computer Graphics.

[30] Pavlo Molchanov,et al. Hand Pose Estimation via Latent 2.5D Heatmap Regression , 2018, ECCV.

[31] Jian Sun,et al. Cascaded hand pose regression , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Alex Pentland,et al. Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[33] Qiang Li,et al. End-to-End Hand Mesh Recovery From a Monocular RGB Image , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34] Mingliang Chen,et al. A hand pose tracking benchmark from stereo matching , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[35] Lourdes Agapito,et al. Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Wolfgang Hürst,et al. Gesture-based interaction via finger tracking for mobile augmented reality , 2011, Multimedia Tools and Applications.

[37] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[39] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Antonis A. Argyros,et al. Using a Single RGB Frame for Real Time 3D Hand Pose Estimation in the Wild , 2017, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41] Antti Oulasvirta,et al. Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data , 2013, 2013 IEEE International Conference on Computer Vision.

[42] Tae-Kyun Kim,et al. Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Nadia Magnenat-Thalmann,et al. Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44] Tae-Kyun Kim,et al. Weakly-Supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Ji Liu,et al. HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation , 2020, ACM Multimedia.

[46] Iasonas Kokkinos,et al. Weakly-Supervised Mesh-Convolutional Hand Reconstruction in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47] Kasper Hornbæk,et al. Vulture: a mid-air word-gesture keyboard , 2014, CHI.

[48] Serge J. Belongie,et al. Pose2Instance: Harnessing Keypoints for Person Instance Segmentation , 2017, ArXiv.

[49] Luc Van Gool,et al. Dense 3D Regression for Hand Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50] Roberto Cipolla,et al. Fast-SCNN: Fast Semantic Segmentation Network , 2019, BMVC.

[51] Michael J. Black,et al. SMPL: A Skinned Multi-Person Linear Model , 2023 .

[52] Jianfei Cai,et al. 3D Hand Shape and Pose Estimation from a Single RGB Image (Supplementary Material) , 2019 .

[53] J. Kautz,et al. Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints , 2020, ECCV.

[54] Haoyu Ma,et al. Nonparametric Structure Regularization Machine for 2D Hand Pose Estimation , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[55] Ming Ouhyoung,et al. A real-time continuous gesture recognition system for sign language , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[56] Yaser Sheikh,et al. Monocular Total Capture: Posing Face, Body, and Hands in the Wild , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Philip H. S. Torr,et al. 3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Yaser Sheikh,et al. Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Sergio Escalera,et al. Depth-Based 3D Hand Pose Estimation: From Current Achievements to Future Goals , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60] Sergio Orts,et al. Large-scale Multiview 3D Hand Pose Dataset , 2017, Image Vis. Comput..

[61] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[62] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[63] Christian Theobalt,et al. HandVoxNet: Deep Voxel-Based Network for 3D Hand Shape and Pose Estimation From a Single Depth Map , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Angela Yao,et al. Disentangling Latent Hands for Image Synthesis and Pose Estimation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65] Cristian Sminchisescu,et al. Deep Multitask Architecture for Integrated 2D and 3D Human Sensing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Jihun Yu,et al. HUMBI: A Large Multiview Dataset of Human Body Expressions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67] Yao Wang,et al. Adaptive Computationally Efficient Network for Monocular 3D Hand Pose Estimation , 2020, ECCV.

[68] Li Cheng,et al. Efficient Hand Pose Estimation from a Single Depth Image , 2013, 2013 IEEE International Conference on Computer Vision.

[69] Christian Theobalt,et al. GANerated Hands for Real-Time 3D Hand Tracking from Monocular RGB , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[70] Li Liu,et al. JGR-P2O: Joint Graph Reasoning based Pixel-to-Offset Prediction Network for 3D Hand Pose Estimation from a Single Depth Image , 2020, ECCV.

[71] Petros Daras,et al. Cross-modal Variational Alignment of Latent Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[72] Yang Zhao,et al. Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73] Otmar Hilliges,et al. Cross-Modal Deep Variational Hand Pose Estimation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[74] Dimitris N. Metaxas,et al. Knowledge As Priors: Cross-Modal Knowledge Generalization for Datasets Without Superior Knowledge , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).