Learning robust representations using recurrent neural networks for skeleton based action classification and detection

Recently, skeleton based action recognition gains more popularity due to affordable depth sensors and real-time skeleton estimation algorithms. Previous Recurrent Neural Networks (RNN) based approaches focus on modeling spatial configuration of skeletons and temporal evolution of body joints. There are certain intrinsic characteristics of the skeleton based actions. For example, the starting point may be varied, an action can be observed at arbitrary viewpoints and the skeletons are noisy. To this end, we present a novel end-to-end architecture based on RNN to learn robust representations from raw skeletons. The architecture includes three new layers, i.e., starting point transformation layer, viewpoint transformation layer and spatial dropout layer, which address the corresponding three problems, respectively. We apply the proposed method to two different tasks: action classification and detection. Experiments on two large-scale datasets (NTU RGB+D and PKU-MMD) show the superiority of our model. Specially, for action detection, our results are more than 33.4% higher the previous results.

[1]  Amr Sharaf,et al.  Real-Time Multi-scale Action Detection from 3D Skeleton Data , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[2]  Gang Yu,et al.  Discriminative Orderlet Mining for Real-Time Recognition of Human-Object Interaction , 2014, ACCV.

[3]  Yong Du,et al.  Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition , 2016, IEEE Transactions on Image Processing.

[4]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[6]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[7]  Hongsong Wang,et al.  Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Marco La Cascia,et al.  3D skeleton-based human action classification: A survey , 2016, Pattern Recognit..

[9]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[10]  Jake K. Aggarwal,et al.  Human activity recognition from 3D data: A review , 2014, Pattern Recognit. Lett..

[11]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[12]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Samsu Sempena,et al.  Human action recognition using Dynamic Time Warping , 2011, Proceedings of the 2011 International Conference on Electrical Engineering and Informatics.

[15]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[16]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Bart Selman,et al.  Human Activity Detection from RGBD Images , 2011, Plan, Activity, and Intent Recognition.

[18]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[19]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Xin Zhao,et al.  Structured Streaming Skeleton -- A New Feature for Online Human Gesture Recognition , 2014, TOMM.

[21]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Sebastian Nowozin,et al.  Action Points: A Representation for Low-latency Online Human Action Recognition , 2012 .