A Two-Stream Variational Adversarial Network for Video Generation

Video generation is an inherently challenging task, as it requires the model to generate realistic content and motion simultaneously. Existing methods generate both motion and content together using a single generator network, but this approach may fail on complex videos. In this paper, we propose a two-stream video generation model that separates content and motion generation into two parallel generators, called Two-Stream Variational Adversarial Network (TwoStreamVAN). Our model outputs a realistic video given an input action label by progressively generating and fusing motion and content features at multiple scales using adaptive motion kernels. In addition, to better evaluate video generation models, we design a new synthetic human action dataset to bridge the difficulty gap between over-complicated human action datasets and simple toy datasets. Our model significantly outperforms existing methods on the standard Weizmann Human Action and MUG Facial Expression datasets, as well as our new dataset.

[1]  Ronen Basri,et al.  Actions as Space-Time Shapes , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[3]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[4]  Anastasios Delopoulos,et al.  The MUG facial expression database , 2010, 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10.

[5]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[6]  Michael D. Buhrmester,et al.  Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[7]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[8]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[9]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[10]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[11]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[12]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[13]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[14]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[17]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[18]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[19]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[20]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[21]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[22]  Konstantinos G. Derpanis,et al.  Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness , 2016, ECCV Workshops.

[23]  Jiajun Wu,et al.  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks , 2016, NIPS.

[24]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[25]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[26]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[27]  Ole Winther,et al.  Autoencoding beyond pixels using a learned similarity metric , 2015, ICML.

[28]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[29]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  David Vázquez,et al.  PixelVAE: A Latent Variable Model for Natural Images , 2016, ICLR.

[31]  Feng Liu,et al.  Video Frame Interpolation via Adaptive Convolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Michael J. Black,et al.  Optical Flow Estimation Using a Spatial Pyramid Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[34]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[35]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[36]  Feng Liu,et al.  Video Frame Interpolation via Adaptive Separable Convolution , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Yong Yu,et al.  Activation Maximization Generative Adversarial Nets , 2017 .

[38]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[39]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[40]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Jan Kautz,et al.  PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Jiawei He,et al.  Probabilistic Video Generation using Holistic Attribute Control , 2018, ECCV.

[44]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).