Adversarial Action Data Augmentation for Similar Gesture Action Recognition

Human gestures are unique for recognizing and describing human actions, and video-based human action recognition techniques are effective solutions to varies real-world applications, such as surveillance, video indexing, and human-computer interaction. Most existing video human action recognition approaches either using handcraft features from the frames or deep learning models such as convolutional neural networks (CNN) and recurrent neural networks (RNN); however, they have mostly overlooked the similar gestures between different actions when processing the frames into the models. The classifiers suffer from similar features extracted from similar gestures, which are unable to classify the actions in the video streams. In this paper, we propose a novel framework with generative adversarial networks (GAN) to generate the data augmentation for similar gesture action recognition. The contribution of our work is tri-fold: 1) we proposed a novel action data augmentation framework (ADAF) to enlarge the differences between the actions with very similar gestures; 2) the framework can boost the classification performance either on similar gesture action pairs or the whole dataset; 3) experiments conducted on both KTH and UCF101 datasets show that our data augmentation framework boost the performance on both similar gestures actions as well as the whole dataset compared with baseline methods such as 2DCNN and 3DCNN.

[1]  Chengqi Zhang,et al.  Tri-Party Deep Network Representation , 2016, IJCAI.

[2]  Ghassan Al-Regib,et al.  TS-LSTM and Temporal-Inception: Exploiting Spatiotemporal Dynamics for Activity Recognition , 2017, Signal Process. Image Commun..

[3]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[5]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[6]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[7]  John W. Fisher,et al.  Dreaming More Data: Class-dependent Distributions over Diffeomorphisms for Learned Data Augmentation , 2015, AISTATS.

[8]  Mubarak Shah,et al.  Learning human actions via information maximization , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Peng Wang,et al.  Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Feng Wang,et al.  A compact pairwise trajectory representation for action recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[12]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, ICPR 2004.

[13]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  C. Krishna Mohan,et al.  Action-vectors: Unsupervised movement modeling for action recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Huafeng Chen,et al.  Multiple instance discriminative dictionary learning for action recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[17]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[18]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xinghao Jiang,et al.  Two-Stream Dictionary Learning Architecture for Action Recognition , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[20]  Christian Wolf,et al.  Sequential Deep Learning for Human Action Recognition , 2011, HBU.

[21]  Vittorio Murino,et al.  Efficient pooling of image based CNN features for action recognition in videos , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lei Zhang,et al.  Realistic human action recognition: When CNNS meet LDS , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Ali Farhadi,et al.  Actions ~ Transformations , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[26]  Gang Sun,et al.  A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Amos J. Storkey,et al.  Data Augmentation Generative Adversarial Networks , 2017, ICLR 2018.

[29]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[30]  Lina Yao,et al.  Adversarially Regularized Graph Autoencoder , 2018, IJCAI.

[31]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Jeremy S. Smith,et al.  Hierarchical Multi-scale Attention Networks for action recognition , 2017, Signal Process. Image Commun..

[33]  Limin Wang,et al.  Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice , 2014, Comput. Vis. Image Underst..

[34]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.