论文信息 - Imagen Video: High Definition Video Generation with Diffusion Models - 字舞流文

Imagen Video: High Definition Video Generation with Diffusion Models

We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high deﬁnition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high deﬁnition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we conﬁrm and transfer ﬁndings from previous work on diffusion-based image generation to the video generation setting. Fi-nally, we apply progressive distillation to our video models with classiﬁer-free guidance for fast, high quality sampling. We ﬁnd Imagen Video not only capable of generating videos of high ﬁdelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding. See for

David J. Fleet | Diederik P. Kingma | Ben Poole | Tim Salimans | Mohammad Norouzi | Jonathan Ho | William Chan | Ruiqi Gao | Jay Whang | A. Gritsenko | Chitwan Saharia

[1] Yaniv Taigman,et al. Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[2] Jonathan Ho. Classifier-Free Diffusion Guidance , 2022, ArXiv.

[3] Li Fei-Fei,et al. MaskViT: Masked Visual Pre-Training for Video Prediction , 2022, ICLR.

[4] Jing Yu Koh,et al. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[5] Tero Karras,et al. Elucidating the Design Space of Diffusion-Based Generative Models , 2022, NeurIPS.

[6] David J. Fleet,et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[7] Frank Wood,et al. Flexible Diffusion Modeling of Long Videos , 2022, ArXiv.

[8] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[9] David J. Fleet,et al. Video Diffusion Models , 2022, NeurIPS.

[10] S. Mandt,et al. Diffusion Probabilistic Modeling for Video Generation , 2022, ArXiv.

[11] Tim Salimans,et al. Progressive Distillation for Fast Sampling of Diffusion Models , 2022, ICLR.

[12] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Prafulla Dhariwal,et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[14] A. Dimakis,et al. Deblurring via Stochastic Refinement , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] David J. Fleet,et al. Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[16] David J. Fleet,et al. Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[17] David J. Fleet,et al. Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18] Vinay Uday Prabhu,et al. Multimodal datasets: misogyny, pornography, and malignant stereotypes , 2021, ArXiv.

[19] Diederik P. Kingma,et al. Variational Diffusion Models , 2021, ArXiv.

[20] Sergey Levine,et al. FitVid: Overfitting in Pixel-Level Video Prediction , 2021, ArXiv.

[21] Heiga Zen,et al. WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis , 2021, Interspeech.

[22] Chang Zhou,et al. CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[23] Prafulla Dhariwal,et al. Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[24] Ronan Le Bras,et al. CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[25] Emily M. Bender,et al. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 , 2021, FAccT.

[26] Prafulla Dhariwal,et al. Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[27] Abhishek Kumar,et al. Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[28] Jiaming Song,et al. Denoising Diffusion Implicit Models , 2020, ICLR.

[29] Bryan Catanzaro,et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[30] Heiga Zen,et al. WaveGrad: Estimating Gradients for Waveform Generation , 2020, ICLR.

[31] Trevor Darrell,et al. Benchmark for Compositional Text-to-Image Synthesis , 2021, NeurIPS Datasets and Benchmarks.

[32] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[33] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[34] S. Levine,et al. VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation , 2019, ICLR.

[35] Yang Song,et al. Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[36] Maxim Raginsky,et al. Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit , 2019, ArXiv.

[37] Shikha Bordia,et al. Identifying and Reducing Gender Bias in Word-Level Language Models , 2019, NAACL.

[38] Sjoerd van Steenkiste,et al. FVD: A new Metric for Video Generation , 2019, DGS@ICLR.

[39] Sergey Levine,et al. Stochastic Variational Video Prediction , 2017, ICLR.

[40] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[41] Xi Chen,et al. PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[42] Antonio Torralba,et al. Generating Videos with Scene Dynamics , 2016, NIPS.

[43] Sergey Levine,et al. Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[44] Yann LeCun,et al. Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[45] Dit-Yan Yeung,et al. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[46] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[47] Surya Ganguli,et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[48] Marc'Aurelio Ranzato,et al. Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.