论文信息 - K-Order Graph-oriented Transformer with GraAttention for 3D Pose and Shape Estimation

K-Order Graph-oriented Transformer with GraAttention for 3D Pose and Shape Estimation

We propose a novel attention-based 2D-to-3D pose estimation network for graph-structured data, named KOG-Transformer, and a 3D pose-to-shape estimation network for hand data, named GASE-Net. Previous 3D pose estimation methods focus on various modiﬁcations of the graph convolutional kernel, such as abandoning weight sharing or increasing the receptive ﬁeld. Some of these methods employ attention-based non-local modules as auxiliary modules. In order to better model the relationship between nodes in graph-structured data and fuse the information of different neighbor nodes in a differentiated way, we make targeted modiﬁcations to the attention module and propose two modules designed for graph-structured data, graph relative positional encoding multi-head self-attention (GR-MSA) and K-order graph-oriented multi-head self-attention (KOG-MSA). By stacking GR-MSA and KOG-MSA, we propose a novel network KOG-Transformer 1 for 2D-to-3D pose estimation. Further, we propose a network for shape estimation on hand data, called GraAttention Shape Estimation Network (GASE-Net), which takes a 3D pose as input and gradually models the shape of a hand from sparse to dense. We have empirically shown the superiority of KOG-Transformer through extensive experiments. The experimental results show that the KOG-Transformer signiﬁcantly outperforms the previous state-of-the-art methods on the benchmark dataset Human3.6M. We evaluate the effect of GASE-Net on two hand datasets publicly available, ObMan and InterHand2.6M. The experimental results show that the GASE-Net can estimate the corresponding shapes for input poses with strong generalization ability.

Weiqiang Wang | Weixi Zhao | Weiqiang Wang

[1] Seung-won Hwang,et al. GRPE: Relative Positional Encoding for Graph Transformer , 2022, 2201.12787.

[2] Junni Zou,et al. Hierarchical Graph Networks for 3D Human Pose Estimation , 2021, BMVC.

[3] Wei Tang,et al. Modulated Graph Convolutional Network for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4] Lijuan Wang,et al. Mesh Graphormer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5] Wataru Takano,et al. Graph Stacked Hourglass Networks for 3D Human Pose Estimation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Yun Fu,et al. Skeleton Aware Multi-modal Sign Language Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[7] Kevin Lin,et al. End-to-End Human Pose and Mesh Reconstruction with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Wei Tang,et al. A Comprehensive Study of Weight Sharing in Graph Networks for 3D Human Pose Estimation , 2020, ECCV.

[9] Takaaki Shiratori,et al. InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image , 2020, ECCV.

[10] Ruizhi Chen,et al. Joint Hand-Object 3D Reconstruction From a Single Image With Cross-Branch Feature Fusion , 2020, IEEE Transactions on Image Processing.

[11] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[12] David J. Crandall,et al. HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] C. Theobalt,et al. Monocular Real-Time Hand Shape and Motion Capture Using Multi-Modal Data , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[15] Kui Jia,et al. HEMlets Pose: Learning Part-Centric Heatmap Triplets for Accurate 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Yizhou Wang,et al. Optimizing Network Structure for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Xu Chen,et al. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Cordelia Schmid,et al. Learning Joint Reconstruction of Hands and Manipulated Objects , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Yu Tian,et al. Semantic Graph Convolutional Networks for 3D Human Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Junsong Yuan,et al. 3D Hand Shape and Pose Estimation From a Single RGB Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Dong Liu,et al. Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Philip H. S. Torr,et al. 3D Hand Shape and Pose From Images in the Wild , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[24] Alan L. Yuille,et al. OriNet: A Fully Convolutional Network for 3D Human Pose Estimation , 2018, BMVC.

[25] Xiaowei Zhou,et al. Learning to Estimate 3D Human Pose and Shape from a Single Color Image , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26] Xiaogang Wang,et al. 3D Human Pose Estimation in the Wild by Adversarial Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.

[28] Dahua Lin,et al. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[29] James J. Little,et al. Exploiting Temporal Information for 3D Human Pose Estimation , 2017, ECCV.

[30] Gang Yu,et al. Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31] Song-Chun Zhu,et al. Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation , 2017, AAAI.

[32] Gang Sun,et al. Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33] Junsong Yuan,et al. Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST-NBNN) for Skeleton-Based Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[35] James J. Little,et al. A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36] Yaser Sheikh,et al. Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Yichen Wei,et al. Towards 3D Human Pose Estimation in the Wild: A Weakly-Supervised Approach , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38] Yichen Wei,et al. Compositional Human Pose Regression , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39] Deva Ramanan,et al. 3D Human Pose Estimation = 2D Pose Estimation + Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Pascal Fua,et al. Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[41] Xiaowei Zhou,et al. Coarse-to-Fine Volumetric Prediction for Single-Image 3D Human Pose , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Xavier Bresson,et al. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[43] Leonidas J. Guibas,et al. ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[44] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46] Bernt Schiele,et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47] Dimitrios Tzionas,et al. Embodied hands , 2017, ACM Trans. Graph..

[48] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.