Adversarial Generative Grammars for Human Activity Prediction

In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal representations. Being able to select multiple production rules during inference leads to different predicted outcomes, thus efficiently modeling many plausible futures. The adversarial generative grammar is evaluated on the Charades, MultiTHUMOS, Human3.6M, and 50 Salads datasets and on two activity prediction tasks: future 3D human pose prediction and future activity prediction. The proposed adversarial grammar outperforms the state-of-the-art approaches, being able to predict much more accurately and further in the future, than prior work.

[1]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[2]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Stephen J. McKenna,et al.  Combining embedded accelerometers with computer vision for recognizing food preparation activities , 2013, UbiComp.

[4]  Li Fei-Fei,et al.  Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos , 2015, International Journal of Computer Vision.

[5]  Feng Han,et al.  Bottom-Up/Top-Down Image Parsing with Attribute Grammar , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[7]  Rob Fergus,et al.  Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks , 2015, NIPS.

[8]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[9]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[10]  Bernt Schiele,et al.  Time-Conditioned Action Anticipation in One Shot , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Silvio Savarese,et al.  Structural-RNN: Deep Learning on Spatio-Temporal Graphs , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Song-Chun Zhu,et al.  Image Parsing with Stochastic Scene Grammar , 2011, NIPS.

[13]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[14]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Fan Yang,et al.  Differentiable Learning of Logical Rules for Knowledge Base Reasoning , 2017, NIPS.

[17]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Irfan A. Essa,et al.  Recognizing multitasked activities from video using stochastic context-free grammar , 2002, AAAI/IAAI.

[19]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[21]  Yazan Abu Farha,et al.  When will you do what? - Anticipating Temporal Occurrences of Activities , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[23]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[24]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[25]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[26]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Wang Ling,et al.  Memory Architectures in Recurrent Neural Network Language Models , 2018, ICLR.

[28]  Deva Ramanan,et al.  Parsing Videos of Actions with Segmental Grammars , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[30]  Song-Chun Zhu,et al.  Generalized Earley Parser: Bridging Symbolic Grammars and Sequence Data for Future Prediction , 2018, ICML.

[31]  Cristian Sminchisescu,et al.  Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Wei Liu,et al.  Long-Term Human Motion Prediction by Modeling Motion Context and Enhancing Motion Dynamic , 2018, IJCAI.

[33]  Jake K. Aggarwal,et al.  Recognition of Composite Human Activities through Context-Free Grammar Based Representation , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[34]  Aaron F. Bobick,et al.  From Stochastic Grammar to Bayes Network: Probabilistic Parsing of Complex Activity , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Andrew M. Dai,et al.  MaskGAN: Better Text Generation via Filling in the ______ , 2018, ICLR.

[36]  Eric P. Xing,et al.  Toward Controlled Generation of Text , 2017, ICML.

[37]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[38]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[39]  Song-Chun Zhu,et al.  Predicting Human Activities Using Stochastic Grammar , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).