Adversarial Video Generation on Complex Datasets

Generative models of natural images have progressed towards high fidelity samples by the strong leveraging of scale. We attempt to carry this success to the field of video modeling by showing that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work. Our proposed model, Dual Video Discriminator GAN (DVD-GAN), scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator. We evaluate on the related tasks of video synthesis and video prediction, and achieve new state-of-the-art Frechet Inception Distance for prediction for Kinetics-600, as well as state-of-the-art Inception Score for synthesis on the UCF-101 dataset, alongside establishing a strong baseline for synthesis on Kinetics-600.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[4]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[5]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[6]  Takeru Miyato,et al.  cGANs with Projection Discriminator , 2018, ICLR.

[7]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[8]  Luc Van Gool,et al.  Towards High Resolution Video Generation with Progressive Growing of Sliced Wasserstein GANs , 2018, ArXiv.

[9]  Hugo Larochelle,et al.  Modulating early visual processing by language , 2017, NIPS.

[10]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[11]  Olivier Bachem,et al.  Assessing Generative Models via Precision and Recall , 2018, NeurIPS.

[12]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yaser Sheikh,et al.  Recycle-GAN: Unsupervised Video Retargeting , 2018, ECCV.

[14]  Shunta Saito,et al.  TGANv2: Efficient Training of Large Models for Video Generation with Multiple Subsampling Layers , 2018, ArXiv.

[15]  Kate Saenko,et al.  A Two-Stream Variational Adversarial Network for Video Generation , 2018, ArXiv.

[16]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[17]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[18]  Rama Chellappa,et al.  TFGAN: Improving Conditioning for Text-to-Video Synthesis , 2018 .

[19]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[20]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[21]  Jakob Uszkoreit,et al.  Scaling Autoregressive Video Models , 2019, ICLR.

[22]  Chen Fang,et al.  Dance Dance Generation: Motion Transfer for Internet Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[23]  Rishi Sharma,et al.  A Note on the Inception Score , 2018, ArXiv.

[24]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[25]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[26]  Alex Graves,et al.  Video Pixel Networks , 2016, ICML.

[27]  Ying Zhang,et al.  On Multiplicative Integration with Recurrent Neural Networks , 2016, NIPS.

[28]  Tatsuya Harada,et al.  Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture , 2017, AAAI.

[29]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[30]  Ronen Basri,et al.  Actions as space-time shapes , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[31]  Francesca Murabito,et al.  VOS-GAN: Adversarial Learning of Visual-Temporal Dynamics for Unsupervised Dense Prediction in Videos , 2018, ArXiv.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  C. Spampinato,et al.  Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos , 2018, International Journal of Computer Vision.

[34]  Zhe Wang,et al.  Pose Guided Human Video Generation , 2018, ECCV.

[35]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[36]  Sergio Gomez Colmenarejo,et al.  TF-Replicator: Distributed Machine Learning for Researchers , 2019, ArXiv.

[37]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[38]  Yitong Li,et al.  Video Generation From Text , 2017, AAAI.

[39]  Kaiming He,et al.  Group Normalization , 2018, ECCV.

[40]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[41]  Juan Carlos Niebles,et al.  Learning to Decompose and Disentangle Representations for Video Prediction , 2018, NeurIPS.

[42]  Christopher Joseph Pal,et al.  Delving Deeper into Convolutional Networks for Learning Video Representations , 2015, ICLR.

[43]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[44]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[46]  Ruben Villegas,et al.  Learning to Generate Long-term Future via Hierarchical Prediction , 2017, ICML.

[47]  Kate Saenko,et al.  TwoStreamVAN: Improving Motion Modeling in Video Generation , 2018, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[48]  Nal Kalchbrenner,et al.  Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling , 2018, ICLR.

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[51]  Marc'Aurelio Ranzato,et al.  Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.

[52]  Andrea Vedaldi,et al.  Instance Normalization: The Missing Ingredient for Fast Stylization , 2016, ArXiv.

[53]  Jonathon Shlens,et al.  A Learned Representation For Artistic Style , 2016, ICLR.

[54]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[55]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[57]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[58]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[59]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[60]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[61]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[62]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[63]  Martial Hebert,et al.  The Pose Knows: Video Forecasting by Generating Pose Futures , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[64]  Sergey Levine,et al.  Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[65]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[66]  Sergey Levine,et al.  VideoFlow: A Flow-Based Generative Model for Video , 2019, ArXiv.

[67]  Jae Hyun Lim,et al.  Geometric GAN , 2017, ArXiv.

[68]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.