Cross-Channel Graph Convolutional Networks for Skeleton-Based Action Recognition

In recent years, skeleton-based action recognition, graph convolutional networks, have achieved remarkable performance. In these existing works, the features of all nodes in the neighbor set are aggregated into the updated features of the root node, while these features are located in the same feature channel determined by the same $1\times 1$ convolution filter. This may not be optimal for capturing the features of spatial dimensions among adjacent vertices effectively. Besides, the effect of feature channels that are independent of the current action on the performance of the model is rarely investigated in existing methods. In this paper, we propose cross-channel graph convolutional networks for skeleton-based action recognition. The features fusion mechanism in our network is cross-channel, i.e, the updated feature of the root node is derived from different feature channels. Because different feature channels come from different $1 \times 1$ convolution filters, the cross-channel fusion mechanism significantly improves the ability of the model to capture local features among adjacent vertices. Moreover, by introducing a channel attention mechanism to our model, we suppress the influence of feature channels unrelated to action recognition on model performance, which improves the robustness of the model against the feature channels independent of the current action. Extensive experiments on the two large-scale datasets, NTU-RGB+D and Kinetics-Skeleton, demonstrate that the performance of our model exceeds the current mainstream methods.

[1]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[2]  Joan Bruna,et al.  Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[3]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[4]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[5]  Bing Li,et al.  Graph Based Skeleton Motion Representation and Similarity Measurement for Action Recognition , 2016, ECCV.

[6]  Lei Shi,et al.  Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[8]  Nanning Zheng,et al.  View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Shuai Li,et al.  Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Chao Li,et al.  Skeleton-based action recognition with convolutional neural networks , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[12]  Anoop Cherian,et al.  Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons , 2016, ECCV.

[13]  Hong Liu,et al.  Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition , 2017, ArXiv.

[14]  Junsong Yuan,et al.  Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST-NBNN) for Skeleton-Based Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[17]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Mathias Niepert,et al.  Learning Convolutional Neural Networks for Graphs , 2016, ICML.

[20]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[21]  Joan Bruna,et al.  Deep Convolutional Networks on Graph-Structured Data , 2015, ArXiv.

[22]  Yixin Chen,et al.  An End-to-End Deep Learning Architecture for Graph Classification , 2018, AAAI.

[23]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[24]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Hanqing Lu,et al.  Skeleton-Based Action Recognition With Gated Convolutional Neural Networks , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[27]  Eli Shechtman,et al.  In defense of Nearest-Neighbor based image classification , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Austin Reiter,et al.  Interpretable 3D Human Action Analysis with Temporal Convolutional Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29]  Jonathan Masci,et al.  Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Donald F. Towsley,et al.  Diffusion-Convolutional Neural Networks , 2015, NIPS.

[31]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[32]  R. Zemel,et al.  Neural Relational Inference for Interacting Systems , 2018, ICML.

[33]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[34]  Yi Lin,et al.  Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[35]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[36]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Lin Li,et al.  Skeleton-Based Relational Modeling for Action Recognition , 2018, ArXiv.

[38]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Wei Zhang,et al.  Dynamic Graph Representation Learning via Self-Attention Networks , 2018, ArXiv.

[40]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Qilong Wang,et al.  ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).