Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition

Motion characteristics of human actions can be represented by the position variation of skeleton joints. Traditional approaches generally extract the spatial-temporal representation of the skeleton sequences with well-designed hand-crafted features. In this paper, in order to recognize actions according to the relative motion between the limbs and the trunk, we propose an end-to-end hierarchical RNN for skeleton-based action recognition. We divide human skeleton into five main parts in terms of the human physical structure, and then feed them to five independent subnets for local feature extraction. After the following hierarchical feature fusion and extraction from local to global, dimensions of the final temporal dynamics representations are reduced to the same number of action categories in the corresponding data set through a single-layer perceptron. In addition, the output of the perceptron is temporally accumulated as the input of a softmax layer for classification. Random scale and rotation transformations are employed to improve the robustness during training. We compare with five other deep RNN variants derived from our model in order to verify the effectiveness of the proposed network. In addition, we compare with several other methods on motion capture and Kinect data sets. Furthermore, we evaluate the robustness of our model trained with random scale and rotation transformations for a multiview problem. Experimental results demonstrate that our model achieves the state-of-the-art performance with high computational efficiency.

[1]  Luc Van Gool,et al.  Gesture Recognition Portfolios for Personalization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Ruzena Bajcsy,et al.  Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[3]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[4]  Sergio Escalera,et al.  Multi-modal gesture recognition challenge 2013: dataset and results , 2013, ICMI '13.

[5]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[6]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[7]  James A. Reggia,et al.  Robust human action recognition via long short-term memory , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[8]  Qing Zhang,et al.  A Survey on Human Motion Analysis from Depth Data , 2013, Time-of-Flight and Depth Imaging.

[9]  R. Venkatesh Babu,et al.  Real-time human action recognition from motion capture data , 2013, 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG).

[10]  Yun Fu,et al.  Prediction of Human Activity by Discovering Temporal Sequence Patterns , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Ramakant Nevatia,et al.  Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost , 2006, ECCV.

[13]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Alexandros André Chaaraoui,et al.  A discussion on the validation tests employed to compare human action recognition methods using the MSR Action3D dataset , 2014, ArXiv.

[15]  Ling Shao,et al.  Multimodal Dynamic Networks for Gesture Recognition , 2014, ACM Multimedia.

[16]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[17]  Ling Shao,et al.  Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Xi Chen,et al.  Classifying and visualizing motion capture sequences using deep neural networks , 2013, 2014 International Conference on Computer Vision Theory and Applications (VISAPP).

[19]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[21]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Song-Chun Zhu,et al.  Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Nikos Nikolaidis,et al.  Action recognition on motion capture data using a dynemes and forward differences representation , 2014, J. Vis. Commun. Image Represent..

[25]  Ying Wu,et al.  Cross-View Action Modeling, Learning, and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[27]  Venkatesh Babu Radhakrishnan,et al.  Action recognition from motion capture data using Meta-Cognitive RBF Network classifier , 2014, 2014 IEEE Ninth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP).

[28]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[29]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[30]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Hanqing Lu,et al.  Fusing multi-modal features for gesture recognition , 2013, ICMI '13.

[32]  Xuelong Li,et al.  Geometric Mean for Subspace Selection , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Andrew Zisserman,et al.  Domain-Adaptive Discriminative One-Shot Learning of Gestures , 2014, ECCV.

[34]  Ling Shao,et al.  Learning Discriminative Representations from RGB-D Video Data , 2013, IJCAI.

[35]  Tido Röder,et al.  Documentation Mocap Database HDM05 , 2007 .

[36]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[37]  Xiaodong Yang,et al.  Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Dacheng Tao,et al.  Multi-View Intact Space Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Tinne Tuytelaars,et al.  Modeling video evolution for action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[41]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[42]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[43]  Marcus Liwicki,et al.  Scene labeling with LSTM recurrent neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Xuelong Li,et al.  General Tensor Discriminant Analysis and Gabor Features for Gait Recognition , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Ling Shao,et al.  Structure-Preserving Binary Representations for RGB-D Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Nasser Kehtarnavaz,et al.  Real-time human action recognition based on depth motion maps , 2016, Journal of Real-Time Image Processing.

[47]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Samuel Berlemont,et al.  BLSTM-RNN Based 3D Gesture Classification , 2013, ICANN.

[49]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[50]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[51]  Marwan Torki,et al.  Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition , 2013, IJCAI.