Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition

3D action recognition – analysis of human actions based on 3D skeleton data – becomes popular recently due to its succinctness, robustness, and view-invariant representation. Recent attempts on this problem suggested to develop RNN-based learning methods to model the contextual dependency in the temporal domain. In this paper, we extend this idea to spatio-temporal domains to analyze the hidden sources of action-related information within the input data over both domains concurrently. Inspired by the graphical structure of the human skeleton, we further propose a more powerful tree-structure based traversal method. To handle the noise and occlusion in 3D skeleton data, we introduce new gating mechanism within LSTM to learn the reliability of the sequential input data and accordingly adjust its effect on updating the long-term context information stored in the memory cell. Our method achieves state-of-the-art performance on 4 challenging benchmark datasets for 3D human action analysis.

[1]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[2]  Hong Cheng,et al.  Interactive body part contrast mining for human interaction recognition , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[3]  Mohan M. Trivedi,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[4]  Alberto Del Bimbo,et al.  Submitted to Ieee Transactions on Cybernetics 1 3d Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold , 2022 .

[5]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[6]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[7]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[9]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[10]  R. Venkatesh Babu,et al.  Real-time human action recognition from motion capture data , 2013, 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG).

[11]  Hassen Drira,et al.  Human-object interaction recognition by learning the distances between the object and the skeleton joints , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[12]  Fei Han,et al.  Space-Time Representation of People Based on 3D Skeletal Data: A Review , 2016, Comput. Vis. Image Underst..

[13]  Ling Shao,et al.  From handcrafted to learned representations for human action recognition: A survey , 2016, Image Vis. Comput..

[14]  Alan L. Yuille,et al.  Mining 3D Key-Pose-Motifs for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Al Alwani Adnan Salih,et al.  Spatiotemporal representation of 3D skeleton joints-based action recognition using modified spherical harmonics , 2016 .

[17]  Gang Wang,et al.  Multi-modal feature fusion for action recognition in RGB-D sequences , 2014, 2014 6th International Symposium on Communications, Control and Signal Processing (ISCCSP).

[18]  René Vidal,et al.  Moving Poselets: A Discriminative and Interpretable Skeletal Motion Representation for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  Yong Du,et al.  Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition , 2016, IEEE Transactions on Image Processing.

[21]  Eshed Ohn-Bar,et al.  Joint Angles Similiarities and HOG 2 for Action Recognition , 2013 .

[22]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Gang Wang,et al.  Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Ajmal S. Mian,et al.  Learning a non-linear knowledge transfer model for cross-view action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[28]  Alex Graves,et al.  Supervised Sequence Labelling , 2012 .

[29]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[30]  Arif Mahmood,et al.  Real time action recognition using histograms of depth gradients and random decision forests , 2014, IEEE Winter Conference on Applications of Computer Vision.

[31]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[32]  Silvio Savarese,et al.  Social LSTM: Human Trajectory Prediction in Crowded Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Anuj Srivastava,et al.  Accurate 3D action recognition using learning on the Grassmann manifold , 2015, Pattern Recognit..

[34]  Youssef Chahir,et al.  Spatiotemporal representation of 3D skeleton joints-based action recognition using modified spherical harmonics , 2016, Pattern Recognit. Lett..

[35]  Xiaodong Yang,et al.  Effective 3D action recognition using EigenJoints , 2014, J. Vis. Commun. Image Represent..

[36]  Guodong Guo,et al.  Fusing Spatiotemporal Features and Joints for 3D Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[37]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[38]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[39]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[41]  Gang Wang,et al.  Real-Time RGB-D Activity Prediction by Soft Regression , 2016, ECCV.

[42]  Lei Wu,et al.  Effective Active Skeleton Representation for Low Latency Human Action Recognition , 2016, IEEE Transactions on Multimedia.

[43]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Georgios Evangelidis,et al.  Skeletal Quads: Human Action Recognition Using Joint Quadruples , 2014, 2014 22nd International Conference on Pattern Recognition.

[45]  Greg Mori,et al.  Structure Inference Machines: Recurrent Neural Networks for Analyzing Relations in Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Beiji Zou,et al.  Automatic reconstruction of 3D human motion pose from uncalibrated monocular video sequences based on markerless human motion tracking , 2009, Pattern Recognition.

[47]  Xi Wang,et al.  Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[48]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Gang Wang,et al.  A Siamese Long Short-Term Memory Architecture for Human Re-identification , 2016, ECCV.

[50]  Meng Wang,et al.  A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition , 2015, International Journal of Computer Vision.

[51]  Greg Mori,et al.  A Hierarchical Deep Temporal Model for Group Activity Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[53]  Mooi Choo Chuah,et al.  Category-Blind Human Action Recognition: A Practical Recognition System , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[54]  Junsong Yuan,et al.  Learning Actionlet Ensemble for 3D Human Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Ajmal Mian,et al.  3D Action Recognition from Novel Viewpoints , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[57]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[58]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Tian-Tsong Ng,et al.  Multimodal Multipart Learning for Action Recognition in Depth Videos , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Hairong Qi,et al.  Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps , 2013, 2013 IEEE International Conference on Computer Vision.

[61]  Gang Wang,et al.  Gated Siamese Convolutional Neural Network Architecture for Human Re-identification , 2016, ECCV.

[62]  Bingbing Ni,et al.  Progressively Parsing Interactional Objects for Fine Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Marco La Cascia,et al.  3D skeleton-based human action classification: A survey , 2016, Pattern Recognit..

[64]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Nikos Nikolaidis,et al.  Action recognition on motion capture data using a dynemes and forward differences representation , 2014, J. Vis. Commun. Image Represent..

[66]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Zhi Liu,et al.  3D-based Deep Convolutional Neural Network for action recognition with depth sequences , 2016, Image Vis. Comput..

[68]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[69]  Venkatesh Babu Radhakrishnan,et al.  Action recognition from motion capture data using Meta-Cognitive RBF Network classifier , 2014, 2014 IEEE Ninth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP).

[70]  Wei-Shi Zheng,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Juan Carlos Niebles,et al.  A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Ying Wu,et al.  Learning Maximum Margin Temporal Warping for Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[73]  Nasser Kehtarnavaz,et al.  Fusion of depth, skeleton, and inertial data for human action recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[74]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[75]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Rushil Anirudh,et al.  Elastic functional coding of human actions: From vector-fields to latent variables , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).