Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates

Skeleton-based human action recognition has attracted a lot of research attention during the past few years. Recent works attempted to utilize recurrent neural networks to model the temporal dependencies between the 3D positional configurations of human body joints for better analysis of human activities in the skeletal data. The proposed work extends this idea to spatial domain as well as temporal domain to better analyze the hidden sources of action-related information within the human skeleton sequences in both of these domains simultaneously. Based on the pictorial structure of Kinect's skeletal data, an effective tree-structure based traversal framework is also proposed. In order to deal with the noise in the skeletal data, a new gating mechanism within LSTM module is introduced, with which the network can learn the reliability of the sequential data and accordingly adjust the effect of the input data on the updating procedure of the long-term context representation stored in the unit's memory cell. Moreover, we introduce a novel multi-modal feature fusion strategy within the LSTM unit in this paper. The comprehensive experimental results on seven challenging benchmark datasets for human action recognition demonstrate the effectiveness of the proposed method.

[1]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[3]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[4]  Arif Mahmood,et al.  Real time action recognition using histograms of depth gradients and random decision forests , 2014, IEEE Winter Conference on Applications of Computer Vision.

[5]  Hong Cheng,et al.  Interactive body part contrast mining for human interaction recognition , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[6]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[7]  Nasser Kehtarnavaz,et al.  Fusion of depth, skeleton, and inertial data for human action recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[9]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Zhi Liu,et al.  3D-based Deep Convolutional Neural Network for action recognition with depth sequences , 2016, Image Vis. Comput..

[11]  Mohan M. Trivedi,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[12]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Hanqing Lu,et al.  Fusing multi-modal features for gesture recognition , 2013, ICMI '13.

[14]  Jian-Huang Lai,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Youssef Chahir,et al.  Spatiotemporal representation of 3D skeleton joints-based action recognition using modified spherical harmonics , 2016, Pattern Recognit. Lett..

[16]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[17]  Xiaodong Yang,et al.  Effective 3D action recognition using EigenJoints , 2014, J. Vis. Commun. Image Represent..

[18]  Guodong Guo,et al.  Fusing Spatiotemporal Features and Joints for 3D Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[19]  Ennio Gambi,et al.  Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset , 2016 .

[20]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[21]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[22]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Christopher Joseph Pal,et al.  Semi-supervised Learning with Encoder-Decoder Recurrent Neural Networks: Experiments with Motion Capture Sequences , 2015, ArXiv.

[24]  Behrooz Mahasseni,et al.  Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Ajmal Mian,et al.  3D Action Recognition from Novel Viewpoints , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Bingbing Ni,et al.  Recurrent Modeling of Interaction Context for Collective Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Li Fei-Fei,et al.  Unsupervised Learning of Long-Term Motion Dynamics for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Guillermo Garcia-Hernando,et al.  Transition Forests: Learning Discriminative Temporal Transitions for Action Recognition and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Anuj Srivastava,et al.  Accurate 3D action recognition using learning on the Grassmann manifold , 2015, Pattern Recognit..

[31]  Tae-Kyun Kim,et al.  Learning and Refining of Privileged Information-Based RNNs for Action Recognition from Depth Sequences , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[33]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Wei Wang,et al.  Hierarchical motion evolution for action recognition , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[35]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[38]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[40]  Hassen Drira,et al.  Human-object interaction recognition by learning the distances between the object and the skeleton joints , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[41]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Fei Han,et al.  Space-Time Representation of People Based on 3D Skeletal Data: A Review , 2016, Comput. Vis. Image Underst..

[43]  Rushil Anirudh,et al.  Elastic functional coding of human actions: From vector-fields to latent variables , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[45]  Ling Shao,et al.  From handcrafted to learned representations for human action recognition: A survey , 2016, Image Vis. Comput..

[46]  Balaraman Ravindran,et al.  Activity Recognition for Natural Human Robot Interaction , 2014, ICSR.

[47]  Venkatesh Babu Radhakrishnan,et al.  Action recognition from motion capture data using Meta-Cognitive RBF Network classifier , 2014, 2014 IEEE Ninth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP).

[48]  Sergio Escalera,et al.  Multi-modal gesture recognition challenge 2013: dataset and results , 2013, ICMI '13.

[49]  Alberto Del Bimbo,et al.  Submitted to Ieee Transactions on Cybernetics 1 3d Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold , 2022 .

[50]  Mooi Choo Chuah,et al.  Category-Blind Human Action Recognition: A Practical Recognition System , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[51]  Junsong Yuan,et al.  Learning Actionlet Ensemble for 3D Human Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Ying Wu,et al.  Learning Maximum Margin Temporal Warping for Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[53]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[54]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[55]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[56]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[57]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[58]  Luc Van Gool,et al.  Gesture Recognition Portfolios for Personalization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  René Vidal,et al.  Moving Poselets: A Discriminative and Interpretable Skeletal Motion Representation for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[60]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[61]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[62]  R. Venkatesh Babu,et al.  Real-time human action recognition from motion capture data , 2013, 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG).

[63]  Ajmal S. Mian,et al.  Learning a non-linear knowledge transfer model for cross-view action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Gang Wang,et al.  Multi-modal feature fusion for action recognition in RGB-D sequences , 2014, 2014 6th International Symposium on Communications, Control and Signal Processing (ISCCSP).

[65]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[66]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Tian-Tsong Ng,et al.  Multimodal Multipart Learning for Action Recognition in Depth Videos , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[69]  Alan L. Yuille,et al.  Mining 3D Key-Pose-Motifs for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Marco La Cascia,et al.  3D skeleton-based human action classification: A survey , 2016, Pattern Recognit..

[71]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Nikos Nikolaidis,et al.  Action recognition on motion capture data using a dynemes and forward differences representation , 2014, J. Vis. Commun. Image Represent..

[73]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Jun Kong,et al.  Informative joints based human action recognition using skeleton contexts , 2015, Signal Process. Image Commun..

[75]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[77]  Beiji Zou,et al.  Automatic reconstruction of 3D human motion pose from uncalibrated monocular video sequences based on markerless human motion tracking , 2009, Pattern Recognition.

[78]  Andrew Zisserman,et al.  Domain-Adaptive Discriminative One-Shot Learning of Gestures , 2014, ECCV.

[79]  Juan Carlos Niebles,et al.  Discriminative Hierarchical Modeling of Spatio-temporally Composable Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[80]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[81]  Marwan Torki,et al.  Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition , 2013, IJCAI.

[82]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[83]  Alberto Del Bimbo,et al.  Space-Time Pose Representation for 3D Human Action Recognition , 2013, ICIAP Workshops.

[84]  Juan Carlos Niebles,et al.  A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Luc Van Gool,et al.  Deep Learning on Lie Groups for Skeleton-Based Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[86]  Juergen Gall,et al.  Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[87]  Hairong Qi,et al.  Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps , 2013, 2013 IEEE International Conference on Computer Vision.

[88]  Bingbing Ni,et al.  Progressively Parsing Interactional Objects for Fine Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[90]  Al Alwani Adnan Salih,et al.  Spatiotemporal representation of 3D skeleton joints-based action recognition using modified spherical harmonics , 2016 .

[91]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[92]  Song-Chun Zhu,et al.  CERN: Confidence-Energy Recurrent Network for Group Activity Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[93]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[94]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[95]  Junsong Yuan,et al.  Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST-NBNN) for Skeleton-Based Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Georgios Evangelidis,et al.  Skeletal Quads: Human Action Recognition Using Joint Quadruples , 2014, 2014 22nd International Conference on Pattern Recognition.

[97]  Rama Chellappa,et al.  Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Greg Mori,et al.  Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[99]  Gang Wang,et al.  Real-Time RGB-D Activity Prediction by Soft Regression , 2016, ECCV.

[100]  Lei Wu,et al.  Effective Active Skeleton Representation for Low Latency Human Action Recognition , 2016, IEEE Transactions on Multimedia.

[101]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[102]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[103]  Fabio Cuzzolin,et al.  3D Activity Recognition Using Motion History and Binary Shape Templates , 2014, ACCV Workshops.

[104]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .