Recurrent Semantic Preserving Generation for Action Prediction

In this paper, we propose a recurrent semantic preserving generation (RSPG) method for action prediction. Unlike most existing methods which don’t make full use of information from partially observed sequences, we develop a generation architecture to complement the sequence of skeletons for predicting the action, which can exploit more potential information of the movement tendency. Our method learns to capture the tendency of observed sequences and complement the subsequent action with adversarial learning under some constrains, which preserves the consistency between the generation sequence and the observed sequence. By generating the subsequent action, our method can predict the action with the most probability. Moreover, the redundant generation introduces the noise and disturbs the prediction. The insufficient generation cannot exploit the potential information for improving the effect of predicting the action. Our RSPG controls the generation step in a recurrent manner for maximizing the discriminative information of actions, which can adapt to the variable length of different actions. We evaluate our method on four popular action datasets: NTU, UCF101, BIT, and UT-Interaction, and experimental results show that our method achieves very competitive performance with the state-of-the-art.

[1]  Xiaoyan Sun,et al.  Temporal–Spatial Mapping for Action Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Vineeth N. Balasubramanian,et al.  Attentive Semantic Video Generation Using Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Larry H. Matthies,et al.  First-Person Activity Recognition: What Are They Doing to Me? , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Eric P. Xing,et al.  Dual Motion GAN for Future-Flow Embedded Video Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[6]  Eric Eaton,et al.  Online Multi-Task Learning for Policy Gradient Methods , 2014, ICML.

[7]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xiaoou Tang,et al.  Video Frame Synthesis Using Deep Voxel Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9]  Luc Van Gool,et al.  Deep Learning on Lie Groups for Skeleton-Based Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Xinghao Jiang,et al.  Two-Stream Dictionary Learning Architecture for Action Recognition , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Pichao Wang,et al.  Skeleton Optical Spectra-Based Action Recognition Using Convolutional Neural Networks , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[13]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[14]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Jun Miao,et al.  Activity Auto-Completion: Predicting Human Activities from Partial Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Gang Wang,et al.  Hierarchical Spatial Sum–Product Networks for Action Recognition in Still Images , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Wei-Shi Zheng,et al.  Global-Local Temporal Saliency Action Prediction , 2017, IEEE Transactions on Image Processing.

[19]  Bin Sun,et al.  Action Prediction From Videos via Memorizing Hard-to-Predict Samples , 2018, AAAI.

[20]  Yun Fu,et al.  A Discriminative Model with Multiple Temporal Scales for Action Prediction , 2014, ECCV.

[21]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Yi Yang,et al.  2019 Formatting Instructions for Authors Using LaTeX , 2018 .

[23]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Cristian Sminchisescu,et al.  Reinforcement Learning for Visual Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Shuguang Cui,et al.  Deep Reinforcement Learning of Volume-Guided Progressive View Inpainting for 3D Point Scene Completion From a Single Depth Image , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[27]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jiwen Lu,et al.  Part-Activated Deep Reinforcement Learning for Action Prediction , 2018, ECCV.

[29]  Mei Wang,et al.  Fair Loss: Margin-Aware Reinforcement Learning for Deep Face Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Sven J. Dickinson,et al.  Recognize Human Activities from Partially Observed Videos , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Bingbing Ni,et al.  Multiple Granularity Group Interaction Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Vishal M. Patel,et al.  Image De-Raining Using a Conditional Generative Adversarial Network , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[33]  Dumitru Erhan,et al.  Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Richard P. Wildes,et al.  Spatiotemporal Feature Residual Propagation for Action Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yun Fu,et al.  Adversarial Action Prediction Networks , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[38]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[40]  Yun Fu,et al.  Max-Margin Action Prediction Machine , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  José M. F. Moura,et al.  Few-Shot Human Motion Prediction via Meta-learning , 2018, ECCV.

[42]  Sridha Sridharan,et al.  Predicting the Future: A Jointly Learnt Model for Action Anticipation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Silvio Savarese,et al.  A Hierarchical Representation for Future Action Prediction , 2014, ECCV.

[44]  Yunde Jia,et al.  Learning Human Interaction by Interactive Phrases , 2012, ECCV.

[45]  Suman Saha,et al.  Online Real-Time Multiple Spatiotemporal Action Localisation and Prediction , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[46]  Song-Chun Zhu,et al.  Predicting Human Activities Using Stochastic Grammar , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Yi Yang,et al.  Uncovering the Temporal Context for Video Question Answering , 2017, International Journal of Computer Vision.

[48]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[49]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Liang Wang,et al.  Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Yi Yang,et al.  Bidirectional Multirate Reconstruction for Temporal Modeling in Videos , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Gang Wang,et al.  Real-Time RGB-D Activity Prediction by Soft Regression , 2016, ECCV.

[54]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[55]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[56]  Weiwei Liu,et al.  Generating Realistic Videos From Keyframes With Concatenated GANs , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[57]  Xiao Liu,et al.  Action Parsing-Driven Video Summarization Based on Reinforcement Learning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[58]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[59]  Richard Hartley,et al.  Action Anticipation with RBF Kernelized Feature Mapping RNN , 2018, ECCV.

[60]  Svetlana Lazebnik,et al.  Active Object Localization with Deep Reinforcement Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[61]  Shuchang Zhou,et al.  Learning to Paint With Model-Based Deep Reinforcement Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[62]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[64]  Wenhao Wu,et al.  Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).