Skeleton-Based Square Grid for Human Action Recognition With 3D Convolutional Neural Network

Convolutional neural networks (CNNs) can effectively handle grid-structured data but not dynamic skeletons, which are usually expressed as graph structures. In this study, we first propose a skeleton-based square grid (SSG) for transforming dynamic skeletons into three-dimensional (3D) grid-structured data so that CNNs can be applied to such data. Each SSG contains a joint-based square grid (JSG) and a rigid-based square grid (RSG) based on intrinsic and extrinsic dependencies of various body parts, respectively. Next, to enhance the ability of deep features to capture the correlations among 3D grid-structured data, a two-stream 3D CNN is constructed to learn spatiotemporal features using the JSG and RSG sequences. Finally, we introduce a soft attention model that selectively focuses on the informative body parts in the skeleton sequences. We validate our model in terms of action recognition using three datasets: NTU RGB+D, Kinetics Motion, and SBU Kinect Interaction datasets. Our experimental results demonstrate the effectiveness of the proposed approach as well as its superior performance when compared with those of state-of-the-art methods.

[1]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Spatio-Temporal Image Representation of 3D Skeletal Movements for View-Invariant Action Recognition with Deep Convolutional Neural Networks , 2019 .

[3]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[4]  Jian Liu,et al.  Skepxels: Spatio-temporal Image Representation of Human Skeleton Joints for Action Recognition , 2017, CVPR Workshops.

[5]  Jian-Huang Lai,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Zahra Gharaee Hierarchical growing grid networks for skeleton based action recognition , 2020, Cognitive Systems Research.

[7]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[10]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Alex Graves,et al.  Supervised Sequence Labelling , 2012 .

[12]  Nanning Zheng,et al.  View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[14]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[15]  Dacheng Tao,et al.  Graph Edge Convolutional Neural Networks for Skeleton-Based Action Recognition , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[16]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[17]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[18]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[19]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Hanqing Lu,et al.  Body Joint Guided 3-D Deep Convolutional Descriptors for Action Recognition , 2018, IEEE Transactions on Cybernetics.

[21]  Razvan Pascanu,et al.  How to Construct Deep Recurrent Neural Networks , 2013, ICLR.

[22]  Mohammed Bennamoun,et al.  Learning Clip Representations for Skeleton-Based 3D Action Recognition , 2018, IEEE Transactions on Image Processing.

[23]  Richard M. Murray,et al.  A Mathematical Introduction to Robotic Manipulation , 1994 .

[24]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[25]  Jiebo Luo,et al.  Action Recognition With Spatio–Temporal Visual Attention on Skeleton Image Sequences , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[27]  Wenwen Ding,et al.  Global relational reasoning with spatial temporal graph interaction networks for skeleton-based action recognition , 2020, Signal Process. Image Commun..

[28]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[29]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[30]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[31]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[33]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[34]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Lin Gao,et al.  Graph CNNs with Motif and Variable Temporal Block for Skeleton-Based Action Recognition , 2019, AAAI.

[36]  Wei Chu,et al.  Multi-category Classification by Soft-Max Combination of Binary Classifiers , 2003, Multiple Classifier Systems.

[37]  Lei Shi,et al.  Adaptive Spectral Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, ArXiv.

[38]  Louahdi Khoudour,et al.  A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera , 2019, Sensors.

[39]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[40]  Raja Parasuraman,et al.  Attention, biological motion, and action recognition , 2012, NeuroImage.

[41]  Austin Reiter,et al.  Interpretable 3D Human Action Analysis with Temporal Convolutional Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[42]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Hong Cheng,et al.  Interactive body part contrast mining for human interaction recognition , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[45]  Gang Hua,et al.  Attention-based Temporal Weighted Convolutional Neural Network for Action Recognition , 2018, AIAI.

[46]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[47]  Mooi Choo Chuah,et al.  Category-Blind Human Action Recognition: A Practical Recognition System , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).