Skeleton Feature Fusion Based on Multi-Stream LSTM for Action Recognition

Human action recognition from skeleton sequences has attracted a lot of attention in the computer vision community. Long short term memory (LSTM) network has shown its promising performance for this problem, due to their strengths in modeling the dependencies and temporal dynamics of sequential data. However, original LSTM is difficult to grasp the dynamics of entire sequence data, if the input feature of each time step is just a simple combination of raw skeleton data. In this paper, we present a fusion model to make full use of the skeleton data through multi-stream LSTM for action recognition. In each stream of the model, skeleton feature fed to each step of LSTM are extracted from different time duration, which are called single frame feature, short term feature, and long term feature, respectively. Single frame feature represents static pose, which is converted from joints coordinates directly. Short term feature represents skeleton kinematics, which is extracted from a short time window. Long term feature represents joints mutuality during the action process, which is extracted from a longer time window. All these features are modeled by LSTM, and the final states of LSTM streams are fused to predict the underlying actions. The proposed model makes better use of the skeleton dynamics than standard LSTM model. Experimental results on two benchmark skeleton data sets NTU RGB+D data set and SBU interaction dataset show that our proposed approach achieved significant performance.

[1]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yu Qiao,et al.  RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Abhinav Gupta,et al.  What Actions are Needed for Understanding Human Actions in Videos? , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[8]  Gang Wang,et al.  Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[9]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[10]  Haroon Idrees,et al.  The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[11]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[12]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[13]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[14]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[16]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[17]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[18]  Stephen J. Maybank,et al.  Activity recognition using a supervised non-parametric hierarchical HMM , 2016, Neurocomputing.

[19]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[20]  Hairong Qi,et al.  Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Bing Li,et al.  Graph Based Skeleton Motion Representation and Similarity Measurement for Action Recognition , 2016, ECCV.

[24]  Sanghoon Lee,et al.  Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[25]  Mohammed Bennamoun,et al.  Learning Action Recognition Model from Depth and Skeleton Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[27]  Nasser Kehtarnavaz,et al.  Multi-Temporal Depth Motion Maps-Based Local Binary Patterns for 3-D Human Action Recognition , 2017, IEEE Access.

[28]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Marco La Cascia,et al.  3D skeleton-based human action classification: A survey , 2016, Pattern Recognit..

[31]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[32]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[34]  Hong Liu,et al.  3D Action Recognition Using Multi-Temporal Depth Motion Maps and Fisher Vector , 2016, IJCAI.

[35]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[36]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[37]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[38]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[39]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[40]  Sung Wook Baik,et al.  Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features , 2018, IEEE Access.

[41]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[42]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[44]  Yueting Zhuang,et al.  Fusing Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks , 2018, IEEE Transactions on Multimedia.

[45]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Samuel Berlemont,et al.  BLSTM-RNN Based 3D Gesture Classification , 2013, ICANN.

[47]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[48]  Chen Chen,et al.  Memory Attention Networks for Skeleton-Based Action Recognition , 2018, IEEE Transactions on Neural Networks and Learning Systems.