Robust Human Action Recognition Using Global Spatial-Temporal Attention for Human Skeleton Data

Human action recognition from video sequences is one of the most challenging computer vision applications, primarily owing to intrinsic variations in lighting, pose, occlusions, and other factors. The human skeleton joints extracted by the depth camera Kinect have the advantages of simplified structures and rich contents, and are therefore widely used for capturing human actions. However, at present, most of the skeletal joint and Deep learning based action recognition methods treat all skeletal joints equally in both spatial and temporal dimensions. Logically, this is not in accordance with the fact that for different human actions the contributions from skeletal joints could significantly vary spatially and temporally. Incorporating information pertaining to such natural variations will certainly aid in designing a robust human action recognitions system. Hence, in this work, we endeavor to propose a global spatial attention (GSA) model to suitably express the different skeletal joints with different weights so as to provide precise spatial information for human action recognition. Further, we will introduce the notion of accumulative learning curve (ALC) model that can highlight which frames contribute most to the final decision by giving varying temporal weights to each intermediate accumulated learning results provided by an LSTM upon input frames. The proposed GSA (for spatial information) and ALC (for temporal processing) models are integrated into the LSTM framework to construct a robust action recognition framework that takes the human skeletal joints as input and predicts the human action using the enhanced spatial-temporal attention model. Rigorous experiments on NTU datasets (by-far the largest benchmark RGB-D dataset) show that the proposed framework offers the best performance accuracy, least algorithmic complexity and training overheads, when compared with other state-of-the-art human action recognition models.

[1]  Wei-Shi Zheng,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[3]  Georgios Evangelidis,et al.  Skeletal Quads: Human Action Recognition Using Joint Quadruples , 2014, 2014 22nd International Conference on Pattern Recognition.

[4]  Sanghoon Lee,et al.  Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[6]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[7]  Hideki Nakayama,et al.  Multimodal Gesture Recognition Using Multi-stream Recurrent Neural Network , 2015, PSIVT.

[8]  Jake K. Aggarwal,et al.  Human activity recognition from 3D data: A review , 2014, Pattern Recognit. Lett..

[9]  Xiaoming Liu,et al.  On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[10]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[11]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[12]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[13]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[15]  Nasser Kehtarnavaz,et al.  Real-time human action recognition based on depth motion maps , 2016, Journal of Real-Time Image Processing.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Ser-Nam Lim,et al.  Adaptive RNN Tree for Large-Scale Human Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[21]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[22]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[23]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[25]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[26]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[27]  Mohan M. Trivedi,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[28]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[29]  John H. R. Maunsell,et al.  Neuronal Mechanisms of Visual Attention. , 2015, Annual review of vision science.