Human action recognition model incorporating multiscale temporal convolutional network and spatiotemporal excitation network

Abstract. Human action recognition is a research hotspot in the field of computer vision. Focusing on the problem of similar action recognition, we propose an improved two-stream adaptive graph convolutional network for skeleton-based action recognition, which incorporating a multiscale temporal convolutional network and a spatiotemporal excitation network. Using the multiscale temporal convolutional network, the temporal information can be effectively extracted by dilated convolution at different scales so as to broaden the width of the temporal network and extract more temporal features with slight difference between categories at the same time. By utilizing the spatiotemporal excitation network, the input features can be obtained through channel pooling to form single-channel features for two-dimensional convolution, by which important spatiotemporal information can be excited and the role of local nodes in similar actions can be effectively enhanced. Extensive tests and ablation studies on the three large-scale datasets, NTU-RGB+D60, NTU-RGB+D120, and Kinetics-Skeleton, were conducted. Our model outperforms the baseline by 7.2% and the state-of-the-art model by 4% in the similar action recognition on NTU-RGB+D60 dataset on average, which demonstrates the superiority of our model.

[1]  A. Smolic,et al.  ACTION-Net: Multipath Excitation for Action Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Qian Huifang,et al.  Review of Human Action Recognition Based on Deep Learning , 2021 .

[3]  Xiaojuan Wang,et al.  Multi-Scale Adaptive Graph Convolutional Network for Skeleton-Based Action Recognition , 2020, 2020 15th International Conference on Computer Science & Education (ICCSE).

[4]  Huiming Tang,et al.  Dynamic GCN: Context-enriched Topology Learning for Skeleton-based Action Recognition , 2020, ACM Multimedia.

[5]  Fan Zhang,et al.  BlazePose: On-device Real-time Body Pose tracking , 2020, ArXiv.

[6]  Dacheng Tao,et al.  Context Aware Graph Convolution for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Zhiyong Wang,et al.  Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Stephen J. Maybank,et al.  Feedback Graph Convolutional Network for Skeleton-Based Action Recognition , 2020, IEEE Transactions on Image Processing.

[9]  Yifan Zhang,et al.  Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks , 2019, IEEE Transactions on Image Processing.

[10]  Jefersson Alex dos Santos,et al.  SkeleMotion: A New Representation of Skeleton Joint Sequences based on Motion Information for 3D Action Recognition , 2019, 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[11]  Qinghua Huang,et al.  Learning Shape-Motion Representations from Geometric Algebra Spatio-Temporal Model for Skeleton-Based Action Recognition , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[12]  Liang Wang,et al.  Richly Activated Graph Convolutional Network for Action Recognition with Incomplete Skeletons , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[13]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Xu Chen,et al.  Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Nanning Zheng,et al.  Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Tieniu Tan,et al.  An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  P. J. Narayanan,et al.  Part-based Graph Convolutional Network for Action Recognition , 2018, BMVC.

[19]  Lei Shi,et al.  Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Zhaoxiang Zhang,et al.  Relational Network for Skeleton-Based Action Recognition , 2018, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[21]  Shuai Li,et al.  Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[23]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[24]  Sanghoon Lee,et al.  Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  W. Li,et al.  Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks , 2016, Knowl. Based Syst..

[28]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[29]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Matteo Munaro,et al.  Performance evaluation of the 1st and 2nd generation Kinect for multimedia applications , 2015, 2015 IEEE International Conference on Multimedia and Expo (ICME).

[32]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[33]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[34]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[35]  H. Kälviäinen SKELETON-BASED HUMAN ACTION RECOGNITION USING SPATIO-TEMPORAL ATTENTION GRAPH CONVOLUTIONAL NETWORKS , 2022 .

[36]  Hailun Xia,et al.  Multi-Scale Mixed Dense Graph Convolution Network for Skeleton-Based Action Recognition , 2021, IEEE Access.

[37]  Wang Li,et al.  Skeleton-Based Action Recognition Using Multi-Scale and Multi-Stream Improved Graph Convolutional Network , 2020, IEEE Access.