A Cuboid CNN Model With an Attention Mechanism for Skeleton-Based Action Recognition

The introduction of depth sensors such as Microsoft Kinect have driven research in human action recognition. Human skeletal data collected from depth sensors convey a significant amount of information for action recognition. While there has been considerable progress in action recognition, most existing skeleton-based approaches neglect the fact that not all human body parts move during many actions, and they fail to consider the ordinal positions of body joints. Here, and motivated by the fact that an action's category is determined by local joint movements, we propose a cuboid model for skeleton-based action recognition. Specifically, a cuboid arranging strategy is developed to organize the pairwise displacements between all body joints to obtain a cuboid action representation. Such a representation is well structured and allows deep CNN models to focus analyses on actions. Moreover, an attention mechanism is exploited in the deep model, such that the most relevant features are extracted. Extensive experiments on our new Yunnan University-Chinese Academy of Sciences-Multimodal Human Action Dataset (CAS-YNU MHAD), the NTU RGB+D dataset, the UTD-MHAD dataset, and the UTKinect-Action3D dataset demonstrate the effectiveness of our method compared to the current state-of-the-art.

[1]  Wei Liu,et al.  Latent Max-Margin Multitask Learning With Skelets for 3-D Action Recognition , 2017, IEEE Transactions on Cybernetics.

[2]  Pichao Wang,et al.  Skeleton Optical Spectra-Based Action Recognition Using Convolutional Neural Networks , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Chao Li,et al.  Skeleton-based action recognition with convolutional neural networks , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[4]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[5]  Jun Cheng,et al.  Spatio-temporal cuboid pyramid for action recognition using depth motion sequences , 2016, 2016 Eighth International Conference on Advanced Computational Intelligence (ICACI).

[6]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[7]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[8]  David Picard,et al.  Learning features combination for human action recognition from skeleton sequences , 2017, Pattern Recognit. Lett..

[9]  Fang Liu,et al.  Global Low-Rank Image Restoration With Gaussian Mixture Model , 2018, IEEE Transactions on Cybernetics.

[10]  Qi Tian,et al.  Incremental Codebook Adaptation for Visual Representation and Categorization , 2018, IEEE Transactions on Cybernetics.

[11]  Gang Wang,et al.  Recurrent Attentional Networks for Saliency Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[13]  Hugo Larochelle,et al.  Recurrent Mixture Density Network for Spatiotemporal Visual Attention , 2016, ICLR.

[14]  Yuqing Gao,et al.  Robust Fusion of Color and Depth Data for RGB-D Target Tracking Using Adaptive Range-Invariant Depth Models and Spatio-Temporal Consistency Constraints , 2018, IEEE Transactions on Cybernetics.

[15]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[17]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Xuelong Li,et al.  Principal Component 2-D Long Short-Term Memory for Font Recognition on Single Chinese Characters , 2016, IEEE Transactions on Cybernetics.

[20]  Jin Zhang,et al.  STFC: Spatio-temporal feature chain for skeleton-based human action recognition , 2015, J. Vis. Commun. Image Represent..

[21]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[22]  Yueting Zhuang,et al.  Fusing Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks , 2018, IEEE Transactions on Multimedia.

[23]  Anastasios Tefas,et al.  Information Clustering Using Manifold-Based Optimization of the Bag-of-Features Representation , 2018, IEEE Transactions on Cybernetics.

[24]  Wu Luo,et al.  Key Joints Selection and Spatiotemporal Mining for Skeleton-Based Action Recognition , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[25]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[27]  Xu Zhao,et al.  Attention-Based Multiview Re-Observation Fusion Network for Skeletal Action Recognition , 2019, IEEE Transactions on Multimedia.

[28]  Hong Liu,et al.  Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition , 2017, ArXiv.

[29]  Wei Liu,et al.  Discriminative Multi-instance Multitask Learning for 3D Action Recognition , 2017, IEEE Transactions on Multimedia.

[30]  Joseph J. LaViola,et al.  Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition , 2013, International Journal of Computer Vision.

[31]  Feiping Nie,et al.  Euler Label Consistent K-SVD for image classification and action recognition , 2018, Neurocomputing.

[32]  Wanqing Li,et al.  Discriminative Key Pose Extraction Using Extended LC-KSVD for Action Recognition , 2014, 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[33]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[34]  Pichao Wang,et al.  Joint Distance Maps Based Action Recognition With Convolutional Neural Networks , 2017, IEEE Signal Processing Letters.

[35]  Xiaodong Yang,et al.  Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Shuang Wang,et al.  Skeleton-based action recognition using LSTM and CNN , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[37]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[38]  Jing Zhang,et al.  ConvNets-Based Action Recognition from Depth Maps through Virtual Cameras and Pseudocoloring , 2015, ACM Multimedia.

[39]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yong Du,et al.  Skeleton based action recognition with convolutional neural network , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[42]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[43]  Jianxin Wu,et al.  A Tube-and-Droplet-Based Approach for Representing and Analyzing Motion Trajectories , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Hong Liu,et al.  Fusing shape and motion matrices for view invariant action recognition using 3D skeletons , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[45]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Koichi Shinoda,et al.  Spectral Graph Skeletons for 3D Action Recognition , 2014, ACCV.