Leveraging Pre-trained CNN Models for Skeleton-Based Action Recognition

Skeleton-based human action recognition has recently drawn increasing attention thanks to the availability of low-cost motion capture devices, and accessibility of large-scale 3D skeleton datasets. One of the key challenges in action recognition lies in the high dimensionality of the captured data. In recent works, researchers draw inspiration from the success of deep learning in computer vision in order to improve the performances of action recognition systems. Unfortunately, most of these studies do not leverage different available deep architectures but develop new architectures. Most of the available architecture achieve very high accuracy in different image classification problems. In this paper, we use these architectures that are already pre-trained on other image classification tasks. Skeleton sequences are first transformed into image-like data representation. The resulting images are used to train different state-of-the-art CNN architectures following different training procedures. The experimental results obtained on the popular NTU RGB+D dataset, are very promising and outperform most of the state-of-the-art results.

[1]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ying Zhang,et al.  Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks , 2016, INTERSPEECH.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[5]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[6]  Hong Liu,et al.  Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition , 2017, ArXiv.

[7]  Yong Du,et al.  Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition , 2016, IEEE Transactions on Image Processing.

[8]  Thierry Dutoit,et al.  3D skeleton‐based action recognition by representing motion capture sequences as 2D‐RGB images , 2017, Comput. Animat. Virtual Worlds.

[9]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[10]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[12]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[13]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks , 2018, Knowl. Based Syst..

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Liang Zheng,et al.  Unsupervised Person Re-identification: Clustering and Fine-tuning , 2017 .

[16]  Chao Li,et al.  End-to-end learning of deep convolutional neural network for 3D human action recognition , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[17]  Luc Van Gool,et al.  Deep Learning on Lie Groups for Skeleton-Based Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jitendra Malik,et al.  Recognizing action at a distance , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[19]  Mohan M. Trivedi,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[20]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yong Du,et al.  Skeleton based action recognition with convolutional neural network , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[25]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Heike Adel,et al.  Exploring Different Dimensions of Attention for Uncertainty Detection , 2016, EACL.

[28]  Xiaogang Wang,et al.  T-CNN: Tubelets With Convolutional Neural Networks for Object Detection From Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[30]  Tao Mei,et al.  Action Recognition by Learning Deep Multi-Granular Spatio-Temporal Video Representation , 2016, ICMR.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[33]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[35]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[36]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[37]  Pichao Wang,et al.  Joint Distance Maps Based Action Recognition With Convolutional Neural Networks , 2017, IEEE Signal Processing Letters.

[38]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[39]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Weiguo Fan,et al.  A new image classification method using CNN transfer learning and web data augmentation , 2018, Expert Syst. Appl..