论文信息 - Multi-Level Co-Occurrence Graph Convolutional LSTM for Skeleton-Based Action Recognition

Multi-Level Co-Occurrence Graph Convolutional LSTM for Skeleton-Based Action Recognition

Human action recognition plays an important role in e-health applications, such as surgical skill analysis, patient monitoring, and automatic nursing systems. Recently, skeleton-based action recognition gains massive attention. It is an essential yet challenging task that requires effectively modeling the intra-frame skeleton representation and inter-frame temporal dynamics. Traditional Long Short-Term Memory (LSTM) based methods mainly capture long-term action context information from global level, yet they cannot fully model the relationship between different joints or persons to mine crucial co-occurrence features from different levels. To overcome this drawback, we propose a general end-to-end Multi-level Co-occurrence Graph Convolutional LSTM (MCGC-LSTM). By incorporating graph convolutional networks (GCN) into LSTM, our model can not only better exploit body structural information from skeletons but also enhance the multi-level co-occurrence feature learning. Specifically, we first devise multi-level co-occurrence (MC) memory units coupled with GCN to automatically model the spatial relationship between joints, and simultaneously capture the co-occurrence features from different joints, persons, and frames. Then we construct aggregated features of multi-level co-occurrences (AFMC) from MC memory units to better represent the intra-frame action context encoding, and leverage a concurrent LSTM (Co-LSTM) to further model their temporal dynamics for action recognition. Experiments show that our proposed model significantly outperforms mainstream methods on NTU RGB+D 120 dataset and Northwestern-UCLA dataset.

Bin Hu | Xiping Hu | Haocong Rao | Shihao Xu

[1] Jian-Huang Lai,et al. Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2] Mohammed Bennamoun,et al. A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Ying Wu,et al. Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5] Hong Liu,et al. Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[6] Gang Wang,et al. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Chao Li,et al. Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[8] Zhenghao Chen,et al. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Ying Wu,et al. Cross-View Action Modeling, Learning, and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10] Mathias Niepert,et al. Learning Convolutional Neural Networks for Graphs , 2016, ICML.

[11] Xiaohui Xie,et al. Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[12] Marwan Torki,et al. Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[13] Jonathan Masci,et al. Geometric Deep Learning on Graphs and Manifolds Using Mixture Model CNNs , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Tomás Pajdla,et al. 3D with Kinect , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[15] Alán Aspuru-Guzik,et al. Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[16] Tao Mei,et al. Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[17] Gang Wang,et al. Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[19] Peter Fu-Ming Hu,et al. Real-Time Identification of Operating Room State from Video , 2007, AAAI.

[20] Jian Liu,et al. Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition , 2017, CVPR Workshops.

[21] Max Welling,et al. Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[22] Mohammed Bennamoun,et al. Learning Clip Representations for Skeleton-Based 3D Action Recognition , 2018, IEEE Transactions on Image Processing.

[23] Xavier Bresson,et al. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[24] Gang Yu,et al. Discriminative Orderlet Mining for Real-Time Recognition of Human-Object Interaction , 2014, ACCV.

[25] Junsong Yuan,et al. Recognizing Human Actions as the Evolution of Pose Estimation Maps , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26] Jure Leskovec,et al. Inductive Representation Learning on Large Graphs , 2017, NIPS.

[27] Gang Wang,et al. Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Zhengyou Zhang,et al. Microsoft Kinect Sensor and Its Effect , 2012, IEEE Multim..

[29] Yansong Tang,et al. Deep Progressive Reinforcement Learning for Skeleton-Based Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30] Fei Han,et al. Space-Time Representation of People Based on 3D Skeletal Data: A Review , 2016, Comput. Vis. Image Underst..

[31] Gregory D. Hager,et al. Surgical gesture classification from video and kinematic data , 2013, Medical Image Anal..

[32] Binlong Li,et al. Cross-view activity recognition using Hankelets , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[33] Gang Wang,et al. Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[34] Gang Wang,et al. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[35] Gang Wang,et al. Early Action Prediction by Soft Regression , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36] Gang Wang,et al. Skeleton-Based Online Action Prediction Using Scale Selection Network , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37] J.K. Aggarwal,et al. Human activity analysis , 2011, ACM Comput. Surv..

[38] Dahua Lin,et al. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[39] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40] Lei Shi,et al. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Rama Chellappa,et al. Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[42] Yong Du,et al. Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition , 2016, IEEE Transactions on Image Processing.

[43] Wenjun Zeng,et al. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[44] Guangfeng Lin,et al. Three-Stream Convolutional Neural Network With Multi-Task and Ensemble Learning for 3D Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[45] Pascal Frossard,et al. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains , 2012, IEEE Signal Processing Magazine.

[46] Gong Zhang,et al. GCN-GAN: A Non-linear Temporal Link Prediction Model for Weighted Dynamic Networks , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[47] Honghai Liu,et al. RGB-D sensing based human action and interaction analysis: A survey , 2019, Pattern Recognit..

[48] Joan Bruna,et al. Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[49] Ruonan Li,et al. Discriminative virtual views for cross-view action recognition , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50] Gang Wang,et al. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Junsong Yuan,et al. Learning Actionlet Ensemble for 3D Human Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52] Subhransu Maji,et al. Action recognition from a distributed representation of pose and appearance , 2011, CVPR 2011.

[53] Zhaoxiang Zhang,et al. Relational Network for Skeleton-Based Action Recognition , 2018, 2019 IEEE International Conference on Multimedia and Expo (ICME).