Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition

One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the State-Of-The-Art (SOTA) models of this task tends to be exceedingly sophisticated and over-parameterized, where the low efficiency in model training and inference has obstructed the development in the field, especially for large-scale action datasets. In this work, we propose an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block. Firstly, an MIB is designed to enrich informative skeleton features and remain compact representations at an early fusion stage. Then, inspired by the success of the ResNet architecture in Convolutional Neural Network (CNN), a ResGCN module is introduced in GCN to alleviate computational costs and reduce learning difficulties in model training while maintain the model accuracy. Finally, a PartAtt block is proposed to discover the most essential body parts over a whole action sequence and obtain more explainable representations for different skeleton action sequences. Extensive experiments on two large-scale datasets, i.e., NTU RGB+D 60 and 120, validate that the proposed baseline slightly outperforms other SOTA models and meanwhile requires much fewer parameters during training and inference procedures, e.g., at most 34 times less than DGNN, which is one of the best SOTA methods.

[1]  Liang Wang,et al.  Richly Activated Graph Convolutional Network for Action Recognition with Incomplete Skeletons , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[2]  Tieniu Tan,et al.  An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[4]  Bjorn Ottersten,et al.  Vertex Feature Encoding and Hierarchical Temporal Modeling in a Spatio-Temporal Graph Convolutional Network for Action Recognition , 2019, 2020 25th International Conference on Pattern Recognition (ICPR).

[5]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[6]  Xiaopeng Hong,et al.  Learning Graph Convolutional Network for Skeleton-based Human Action Recognition by Neural Searching , 2019, AAAI.

[7]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[8]  Lei Shi,et al.  Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  P. J. Narayanan,et al.  Part-based Graph Convolutional Network for Action Recognition , 2018, BMVC.

[10]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[12]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Christian Wolf,et al.  Human Action Recognition: Pose-Based Attention Draws Focus to Hands , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[14]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Nanning Zheng,et al.  Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Dacheng Tao,et al.  Graph Edge Convolutional Neural Networks for Skeleton-Based Action Recognition , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[17]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[18]  Satoshi Nakamura,et al.  Make Skeleton-based Action Recognition Model Smaller, Faster and Better , 2019, MMAsia.

[19]  Jiaying Liu,et al.  Optimized Skeleton-based Action Recognition via Sparsified Graph Regression , 2018, ACM Multimedia.

[20]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[21]  Ruoyu Li,et al.  Adaptive Graph Convolutional Neural Networks , 2018, AAAI.

[22]  Tieniu Tan,et al.  Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning , 2018, ECCV.

[23]  BoyerEdmond,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011 .

[24]  Xu Chen,et al.  Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Zhengyou Zhang,et al.  Microsoft Kinect Sensor and Its Effect , 2012, IEEE Multim..

[26]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[27]  Yan Huang,et al.  Part-Level Graph Convolutional Network for Skeleton-Based Action Recognition , 2020, AAAI.

[28]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jefersson Alex dos Santos,et al.  SkeleMotion: A New Representation of Skeleton Joint Sequences based on Motion Information for 3D Action Recognition , 2019, 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[31]  Chao Li,et al.  Skeleton-based action recognition with convolutional neural networks , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[32]  William Robson Schwartz,et al.  Skeleton Image Representation for 3D Action Recognition Based on Tree Structure and Reference Joints , 2019, 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI).

[33]  Chongruo Wu,et al.  ResNeSt: Split-Attention Networks , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[34]  G. Johansson Visual perception of biological motion and a model for its analysis , 1973 .

[35]  Nanning Zheng,et al.  View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Lei Shi,et al.  Skeleton-Based Action Recognition With Directed Graph Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..