Learning Skeletal Graph Neural Networks for Hard 3D Pose Estimation

Various deep learning techniques have been proposed to solve the single-view 2D-to-3D pose estimation problem. While the average prediction accuracy has been improved significantly over the years, the performance on hard poses with depth ambiguity, self-occlusion, and complex or rare poses is still far from satisfactory. In this work, we target these hard poses and present a novel skeletal GNN learning solution. To be specific, we propose a hop-aware hierarchical channel-squeezing fusion layer to effectively extract relevant information from neighboring nodes while suppressing undesired noises in GNN learning. In addition, we propose a temporal-aware dynamic graph construction procedure that is robust and effective for 3D pose estimation. Experimental results on the Human3.6M dataset show that our solution achieves 10.3% average prediction accuracy improvement and greatly improves on hard poses over state-of-the-art techniques. We further apply the proposed technique on the skeleton-based action recognition task and also achieve state-of-the-art performance. Our code is available at https://github. com/ailingzengzzz/Skeletal-GNN .

[1]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[2]  Xiaopeng Hong,et al.  Learning Graph Convolutional Network for Skeleton-based Human Action Recognition by Neural Searching , 2019, AAAI.

[3]  Ken-ichi Kawarabayashi,et al.  Representation Learning on Graphs with Jumping Knowledge Networks , 2018, ICML.

[4]  James J. Little,et al.  Exploiting Temporal Information for 3D Human Pose Estimation , 2017, ECCV.

[5]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[6]  Louahdi Khoudour,et al.  A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera , 2019, Sensors.

[7]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Wei Tang,et al.  Learning Global Pose Features in Graph Convolutional Networks for 3D Human Pose Estimation , 2020, ACCV.

[10]  Xiaoxiao Li,et al.  Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Haiping Lu,et al.  Hop-Hop Relation-aware Graph Neural Networks , 2020, ArXiv.

[13]  Pascal Fua,et al.  Monocular 3D Human Pose Estimation in the Wild Using Improved CNN Supervision , 2016, 2017 International Conference on 3D Vision (3DV).

[14]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Xu Chen,et al.  Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Matteo Matteucci,et al.  Spatial Temporal Transformer Network for Skeleton-based Action Recognition , 2020, ICPR Workshops.

[17]  Yifan Zhang,et al.  Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition , 2020, ECCV.

[18]  Nojun Kwak,et al.  3D Human Pose Estimation with Relational Networks , 2018, BMVC.

[19]  Le Wang,et al.  High-order Graph Convolutional Networks for 3D Human Pose Estimation , 2020, BMVC.

[20]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jingang Shi,et al.  Mix Dimension in Poincaré Geometry for 3D Skeleton-based Action Recognition , 2020, ACM Multimedia.

[23]  Gim Hee Lee,et al.  Trajectory Space Factorization for Deep Video-Based 3D Human Pose Estimation , 2019, BMVC.

[24]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Nanning Zheng,et al.  Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Yizhou Wang,et al.  Optimizing Network Structure for 3D Human Pose Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Yongjun Xu,et al.  Rethinking the Number of Channels for the Convolutional Neural Network , 2019, ArXiv.

[30]  Song-Chun Zhu,et al.  Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation , 2017, AAAI.

[31]  James J. Little,et al.  A Simple Yet Effective Baseline for 3d Human Pose Estimation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Ruiyuan Gao,et al.  Hop-Aware Dimension Optimization for Graph Neural Networks , 2021, ArXiv.

[33]  Kristina Lerman,et al.  MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing , 2019, ICML.

[34]  Yan Chen,et al.  Generalizing Monocular 3D Human Pose Estimation in the Wild , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[35]  Haitao Lin,et al.  LookHops: light multi-order convolution and pooling for graph classification , 2020, ArXiv.

[36]  Stephen Lin,et al.  SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach , 2020, ECCV.

[37]  Bernard Ghanem,et al.  DeepGCNs: Can GCNs Go As Deep As CNNs? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Sanghoon Lee,et al.  Propagating LSTM: 3D Pose Estimation Based on Joint Interdependency , 2018, ECCV.

[39]  Yu Tian,et al.  Semantic Graph Convolutional Networks for 3D Human Pose Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Zhenghao Chen,et al.  Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Nadia Magnenat-Thalmann,et al.  Exploiting Spatial-Temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Alan L. Yuille,et al.  OriNet: A Fully Convolutional Network for 3D Human Pose Estimation , 2018, BMVC.

[43]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[44]  Wei Tang,et al.  A Comprehensive Study of Weight Sharing in Graph Networks for 3D Human Pose Estimation , 2020, ECCV.

[45]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Jiahui Yu,et al.  AutoSlim: Towards One-Shot Architecture Search for Channel Numbers , 2019 .

[47]  Hans-Peter Seidel,et al.  VNect , 2017, ACM Trans. Graph..

[48]  Huiming Tang,et al.  Dynamic GCN: Context-enriched Topology Learning for Skeleton-based Action Recognition , 2020, ACM Multimedia.

[49]  Yifan Zhang,et al.  Skeleton-Based Action Recognition With Shift Graph Convolutional Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Hailun Xia,et al.  Multi-Scale Mixed Dense Graph Convolution Network for Skeleton-Based Action Recognition , 2021, IEEE Access.

[51]  Yifan Zhang,et al.  Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks , 2019, IEEE Transactions on Image Processing.