Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512 x 1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://research.nvidia.com/labs/toronto-ai/VideoLDM/

[1]  Patrick Esser,et al.  Structure and Content-Guided Video Synthesis with Diffusion Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Y. Matias,et al.  Dreamix: Video Diffusion Models are General Video Editors , 2023, ArXiv.

[3]  David J. Fleet,et al.  Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Bryan Catanzaro,et al.  eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers , 2022, ArXiv.

[5]  S. Fidler,et al.  LION: Latent Point Diffusion Models for 3D Shape Generation , 2022, NeurIPS.

[6]  Karsten Kreis,et al.  GENIE: Higher-Order Denoising Diffusion Solvers , 2022, NeurIPS.

[7]  Diederik P. Kingma,et al.  On Distillation of Guided Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  D. Erhan,et al.  Phenaki: Variable Length Video Generation From Open Domain Textual Description , 2022, ICLR.

[9]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[10]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[11]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Amit H. Bermano,et al.  An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , 2022, ICLR.

[13]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[14]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[15]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[16]  Stefan Bauer,et al.  Diffusion Models for Video Prediction and Infilling , 2022, Trans. Mach. Learn. Res..

[17]  Alexei A. Efros,et al.  Generating Long Videos of Dynamic Scenes , 2022, NeurIPS.

[18]  Cheng Lu,et al.  DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps , 2022, NeurIPS.

[19]  Sonam Gupta,et al.  RV-GAN: Recurrent GAN for Unconditional Video Generation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Wendi Zheng,et al.  CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers , 2022, ICLR.

[21]  Frank Wood,et al.  Flexible Diffusion Modeling of Long Videos , 2022, NeurIPS.

[22]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[23]  Vikram S. Voleti,et al.  MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation , 2022, ArXiv.

[24]  Yongxin Chen,et al.  Fast Sampling of Diffusion Models with Exponential Integrator , 2022, ICLR.

[25]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[26]  Devi Parikh,et al.  Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer , 2022, ECCV.

[27]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[28]  S. Mandt,et al.  Diffusion Probabilistic Modeling for Video Generation , 2022, Entropy.

[29]  S. Ermon,et al.  Dual Diffusion Implicit Bridges for Image-to-Image Translation , 2022, ICLR.

[30]  Jinwoo Shin,et al.  Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks , 2022, ICLR.

[31]  Yi Ren,et al.  Pseudo Numerical Methods for Diffusion Models on Manifolds , 2022, ICLR.

[32]  Mohammad Norouzi,et al.  Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality , 2022, ICLR.

[33]  Tim Salimans,et al.  Progressive Distillation for Fast Sampling of Diffusion Models , 2022, ICLR.

[34]  Andreas Geiger,et al.  StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets , 2022, SIGGRAPH.

[35]  Michael Elad,et al.  Denoising Diffusion Restoration Models , 2022, NeurIPS.

[36]  L. Gool,et al.  RePaint: Inpainting using Denoising Diffusion Probabilistic Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Bo Zhang,et al.  Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models , 2022, ICLR.

[38]  Mohamed Elhoseiny,et al.  StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[41]  Karsten Kreis,et al.  Tackling the Generative Learning Trilemma with Denoising Diffusion GANs , 2021, ICLR.

[42]  Karsten Kreis,et al.  Score-Based Generative Modeling with Critically-Damped Langevin Diffusion , 2021, ICLR.

[43]  Supasorn Suwajanakorn,et al.  Diffusion Autoencoders: Toward a Meaningful and Decodable Representation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jian Liang,et al.  NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[45]  B. Guo,et al.  Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  David J. Fleet,et al.  Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[47]  S. Ermon,et al.  SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations , 2021, ICLR.

[48]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[49]  Qi Li,et al.  SRDiff: Single Image Super-Resolution with Diffusion Probabilistic Models , 2021, Neurocomputing.

[50]  A. Rogozhnikov Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation , 2022, ICLR.

[51]  Christian Theobalt,et al.  StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN , 2021, BMVC.

[52]  Timo Milbich,et al.  iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Jaakko Lehtinen,et al.  Alias-Free Generative Adversarial Networks , 2021, NeurIPS.

[54]  Stefano Ermon,et al.  D2C: Diffusion-Decoding Models for Few-Shot Conditional Generation , 2021, NeurIPS.

[55]  Jan Kautz,et al.  Score-based Generative Modeling in Latent Space , 2021, NeurIPS.

[56]  Tal Kachman,et al.  Gotta Go Fast When Generating Data with Score-Based Models , 2021, ArXiv.

[57]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[58]  B. Ommer,et al.  Stochastic Image-to-Video Synthesis using cINNs , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Guillermo Sapiro,et al.  GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions , 2021, ArXiv.

[60]  Dimitris N. Metaxas,et al.  A Good Image Generator Is What You Need for High-Resolution Video Synthesis , 2021, ICLR.

[61]  Pieter Abbeel,et al.  VideoGPT: Video Generation using VQ-VAE and Transformers , 2021, ArXiv.

[62]  Chris G. Willcocks,et al.  UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models , 2021, ArXiv.

[63]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[64]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[65]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[66]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[67]  Eric Luhman,et al.  Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed , 2021, ArXiv.

[68]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[70]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[71]  Diego de Las Casas,et al.  Transformation-based Adversarial Video Prediction on Large-Scale Data , 2020, ArXiv.

[72]  P. Gallinari,et al.  Stochastic Latent Residual Video Prediction , 2020, ICML.

[73]  Subramanian Ramamoorthy,et al.  Lower Dimensional Kernels for Video Discriminators , 2019, Neural Networks.

[74]  A. Dantcheva,et al.  G3AN: Disentangling Appearance and Motion for Video Generation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Jakob Uszkoreit,et al.  Scaling Autoregressive Video Models , 2019, ICLR.

[77]  Masanori Koyama,et al.  Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN , 2018, International Journal of Computer Vision.

[78]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[79]  Aaron C. Courville,et al.  Improved Conditional VRNNs for Video Prediction , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[80]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[82]  Ali Farhadi,et al.  Imagine This! Scripts to Compositions to Videos , 2018, ECCV.

[83]  Sergey Levine,et al.  Stochastic Adversarial Video Prediction , 2018, ArXiv.

[84]  Rob Fergus,et al.  Stochastic Video Generation with a Learned Prior , 2018, ICML.

[85]  Sebastian Nowozin,et al.  Which Training Methods for GANs do actually Converge? , 2018, ICML.

[86]  Sergey Levine,et al.  Stochastic Variational Video Prediction , 2017, ICLR.

[87]  Yitong Li,et al.  Video Generation From Text , 2017, AAAI.

[88]  Tao Mei,et al.  To Create What You Tell: Generating Videos from Captions , 2017, ACM Multimedia.

[89]  Vineeth N. Balasubramanian,et al.  Attentive Semantic Video Generation Using Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[90]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[91]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[92]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[93]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[94]  Vineeth N. Balasubramanian,et al.  Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures , 2016, ACM Multimedia.

[95]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[96]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[98]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[99]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[100]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[101]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[102]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[103]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[104]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[105]  Siwei Lyu,et al.  Interpretation and Generalization of Score Matching , 2009, UAI.

[106]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..