论文信息 - Video Diffusion Models

Video Diffusion Models

Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/

[1] Jonathan Ho. Classifier-Free Diffusion Guidance , 2022, ArXiv.

[2] P. Battaglia,et al. Transframer: Arbitrary Frame Prediction with Generative Models , 2022, Trans. Mach. Learn. Res..

[3] S. Mandt,et al. Diffusion Probabilistic Modeling for Video Generation , 2022, Entropy.

[4] Tim Salimans,et al. Progressive Distillation for Fast Sampling of Diffusion Models , 2022, ICLR.

[5] Prafulla Dhariwal,et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[6] A. Dimakis,et al. Deblurring via Stochastic Refinement , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Jian Liang,et al. NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[8] David J. Fleet,et al. Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[9] Cordelia Schmid,et al. CCVS: Context-aware Controllable Video Synthesis , 2021, NeurIPS.

[10] Diederik P. Kingma,et al. Variational Diffusion Models , 2021, ArXiv.

[11] Sergey Levine,et al. FitVid: Overfitting in Pixel-Level Video Prediction , 2021, ArXiv.

[12] David J. Fleet,et al. Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[13] Prafulla Dhariwal,et al. Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[14] Pieter Abbeel,et al. VideoGPT: Video Generation using VQ-VAE and Transformers , 2021, ArXiv.

[15] David J. Fleet,et al. Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Aäron van den Oord,et al. Predicting Video with VQVAE , 2021, ArXiv.

[18] Prafulla Dhariwal,et al. Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[19] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[20] Abhishek Kumar,et al. Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[21] Natalie Parde,et al. Latent Neural Differential Equations for Video Generation , 2020, Preregister@NeurIPS.

[22] Aylin Caliskan,et al. Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases , 2020, FAccT.

[23] Jiaming Song,et al. Denoising Diffusion Implicit Models , 2020, ICLR.

[24] Bryan Catanzaro,et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[25] Heiga Zen,et al. WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[26] Eero P. Simoncelli,et al. Solving Linear Inverse Problems Using the Prior Implicit in a Denoiser , 2020, ArXiv.

[27] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[28] Diego de Las Casas,et al. Transformation-based Adversarial Video Prediction on Large-Scale Data , 2020, ArXiv.

[29] Subramanian Ramamoorthy,et al. Lower Dimensional Kernels for Video Discriminators , 2019, Neural Networks.

[30] Stefan Roth,et al. Markov Decision Process for Video Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[31] Jeff Donahue,et al. Adversarial Video Generation on Complex Datasets , 2019 .

[32] Tim Salimans,et al. Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[33] Yang Song,et al. Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[34] Jakob Uszkoreit,et al. Scaling Autoregressive Video Models , 2019, ICLR.

[35] Maxim Raginsky,et al. Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit , 2019, ArXiv.

[36] Sergey Levine,et al. VideoFlow: A Flow-Based Generative Model for Video , 2019, ArXiv.

[37] Sjoerd van Steenkiste,et al. Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[38] Masanori Koyama,et al. Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN , 2018, International Journal of Computer Vision.

[39] Nal Kalchbrenner,et al. Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling , 2018, ICLR 2018.

[40] Andrew Zisserman,et al. A Short Note about Kinetics-600 , 2018, ArXiv.

[41] Sergey Levine,et al. Stochastic Adversarial Video Prediction , 2018, ArXiv.

[42] Trevor Darrell,et al. Women also Snowboard: Overcoming Bias in Captioning Models , 2018, ECCV.