Video Diffusion Models

Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/

[1]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[2]  P. Battaglia,et al.  Transframer: Arbitrary Frame Prediction with Generative Models , 2022, Trans. Mach. Learn. Res..

[3]  S. Mandt,et al.  Diffusion Probabilistic Modeling for Video Generation , 2022, Entropy.

[4]  Tim Salimans,et al.  Progressive Distillation for Fast Sampling of Diffusion Models , 2022, ICLR.

[5]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[6]  A. Dimakis,et al.  Deblurring via Stochastic Refinement , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jian Liang,et al.  NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[8]  David J. Fleet,et al.  Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[9]  Cordelia Schmid,et al.  CCVS: Context-aware Controllable Video Synthesis , 2021, NeurIPS.

[10]  Diederik P. Kingma,et al.  Variational Diffusion Models , 2021, ArXiv.

[11]  Sergey Levine,et al.  FitVid: Overfitting in Pixel-Level Video Prediction , 2021, ArXiv.

[12]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[13]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[14]  Pieter Abbeel,et al.  VideoGPT: Video Generation using VQ-VAE and Transformers , 2021, ArXiv.

[15]  David J. Fleet,et al.  Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Aäron van den Oord,et al.  Predicting Video with VQVAE , 2021, ArXiv.

[18]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[19]  Heng Wang,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[20]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[21]  Natalie Parde,et al.  Latent Neural Differential Equations for Video Generation , 2020, Preregister@NeurIPS.

[22]  Aylin Caliskan,et al.  Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases , 2020, FAccT.

[23]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[24]  Bryan Catanzaro,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[25]  Heiga Zen,et al.  WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[26]  Eero P. Simoncelli,et al.  Solving Linear Inverse Problems Using the Prior Implicit in a Denoiser , 2020, ArXiv.

[27]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[28]  Diego de Las Casas,et al.  Transformation-based Adversarial Video Prediction on Large-Scale Data , 2020, ArXiv.

[29]  Subramanian Ramamoorthy,et al.  Lower Dimensional Kernels for Video Discriminators , 2019, Neural Networks.

[30]  Stefan Roth,et al.  Markov Decision Process for Video Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[31]  Jeff Donahue,et al.  Adversarial Video Generation on Complex Datasets , 2019 .

[32]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[33]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[34]  Jakob Uszkoreit,et al.  Scaling Autoregressive Video Models , 2019, ICLR.

[35]  Maxim Raginsky,et al.  Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit , 2019, ArXiv.

[36]  Sergey Levine,et al.  VideoFlow: A Flow-Based Generative Model for Video , 2019, ArXiv.

[37]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[38]  Masanori Koyama,et al.  Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN , 2018, International Journal of Computer Vision.

[39]  Nal Kalchbrenner,et al.  Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling , 2018, ICLR 2018.

[40]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[41]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[42]  Trevor Darrell,et al.  Women also Snowboard: Overcoming Bias in Captioning Models , 2018, ECCV.

[43]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[44]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[45]  Pieter Abbeel,et al.  PixelSNAIL: An Improved Autoregressive Generative Model , 2017, ICML.

[46]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[48]  Sergey Levine,et al.  Self-Supervised Visual Planning with Temporal Skip Connections , 2017, CoRL.

[49]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[51]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[52]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[54]  Diederik P. Kingma,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[55]  Thomas Brox,et al.  3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation , 2016, MICCAI.

[56]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[57]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[58]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[59]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[60]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[61]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[62]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[63]  Mohit Bansal,et al.  DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers , 2022, ArXiv.

[64]  Eero P. Simoncelli,et al.  Stochastic Solutions for Linear Inverse Problems using the Prior Implicit in a Denoiser , 2021, NeurIPS.

[65]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.