Action Recognition with the Augmented MoCap Data using Neural Data Translation

This study aims at generating reliable augmented training data to learn a robust deep model for action recognition. The prior knowledge inferred from few training data is not sufficient to well represent the real data distribution, which makes action recognition quite challenging. Inspired by the recent advances in neural machine translation, we propose a neural data translation (NDT) to tackle the aforementioned issue by directly learning the mapping between paired data of the same action class in an end-to-end fashion. The proposed NDT is a sequence-to-sequence generative model. It can be trained with few paired training data, and generates an abundant set of augmented actions with diverse appearance. Specifically, we adopt stochastic pair selection to compile a set of paired training data. Each pair consists of two actions of the same class. One action serves as the input to NDT, while the other acts as the desired output. By learning the mapping between data of the same class, NDT implicitly encodes the intra-class variations so that it can synthesize high-quality actions for augmentation. We evaluated our method on two public datasets, including the Florence3D-action and UCI HAR datasets. The promising results demonstrate that the actions generated by our method effectively improve the performance of action recognition with few examples.

[1]  Mubarak Shah,et al.  Learning a Deep Model for Human Action Recognition from Novel Viewpoints , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Shih-Fu Chang,et al.  Localizing Actions from Video Labels and Pseudo-Annotations , 2017, BMVC.

[3]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[6]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Zhe Gan,et al.  Variational Autoencoder for Deep Learning of Images, Labels and Captions , 2016, NIPS.

[8]  Junsong Yuan,et al.  Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST-NBNN) for Skeleton-Based Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Rama Chellappa,et al.  Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yi-Ping Hung,et al.  Recognizing Human Actions with Outlier Frames by Observation Filtering and Completion , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[11]  Yi-Ping Hung,et al.  Learning and inferring human actions with temporal pyramid features based on conditional random fields , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Juha Karhunen,et al.  Bidirectional Recurrent Neural Networks as Generative Models , 2015, NIPS.

[13]  Davide Anguita,et al.  A Public Domain Dataset for Human Activity Recognition using Smartphones , 2013, ESANN.

[14]  Martial Hebert,et al.  Learning to Learn: Model Regression Networks for Easy Small Sample Learning , 2016, ECCV.

[15]  Garrison W. Cottrell,et al.  WALKING WALKing walking: Action Recognition from Action Echoes , 2017, IJCAI.

[16]  Fang Liu,et al.  Simple to Complex Transfer Learning for Action Recognition , 2016, IEEE Transactions on Image Processing.

[17]  Ramakant Nevatia,et al.  Cascaded Boundary Regression for Temporal Action Detection , 2017, BMVC.

[18]  Min Zhang,et al.  Variational Neural Machine Translation , 2016, EMNLP.

[19]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[20]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  James J. Little,et al.  Unlabelled 3D Motion Examples Improve Cross-View Action Recognition , 2014, BMVC.

[22]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Daniel Jurafsky,et al.  A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.

[24]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[25]  Anoop Cherian,et al.  Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons , 2016, ECCV.

[26]  Rushil Anirudh,et al.  Elastic functional coding of human actions: From vector-fields to latent variables , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jiaying Liu,et al.  Temporal Perceptive Network for Skeleton-Based Action Recognition , 2017, BMVC.

[28]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[29]  Diederik P. Kingma,et al.  Variational Recurrent Auto-Encoders , 2014, ICLR.

[30]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[31]  Alberto Del Bimbo,et al.  Recognizing Actions from Depth Cameras as Weakly Aligned Multi-part Bag-of-Poses , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[32]  David Picard,et al.  Learning features combination for human action recognition from skeleton sequences , 2017, Pattern Recognit. Lett..