Latent Video Diffusion Models for High-Fidelity Long Video Generation

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

[1]  Jinwoo Shin,et al.  Video Probabilistic Diffusion Models in Projected Latent Space , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Jiashi Feng,et al.  MagicVideo: Efficient Video Generation With Latent Diffusion Models , 2022, ArXiv.

[3]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[4]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[5]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[6]  Alexei A. Efros,et al.  Generating Long Videos of Dynamic Scenes , 2022, NeurIPS.

[7]  Wendi Zheng,et al.  CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers , 2022, ICLR.

[8]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[9]  Vikram S. Voleti,et al.  MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation , 2022, ArXiv.

[10]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[11]  Devi Parikh,et al.  Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer , 2022, ECCV.

[12]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[13]  Jinwoo Shin,et al.  Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks , 2022, ICLR.

[14]  Mohamed Elhoseiny,et al.  StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[17]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[18]  Jaakko Lehtinen,et al.  Alias-Free Generative Adversarial Networks , 2021, NeurIPS.

[19]  Aaron C. Courville,et al.  Hierarchical Video Generation for Complex Data , 2021, ArXiv.

[20]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[21]  Dimitris N. Metaxas,et al.  A Good Image Generator Is What You Need for High-Resolution Video Synthesis , 2021, ICLR.

[22]  Pieter Abbeel,et al.  VideoGPT: Video Generation using VQ-VAE and Transformers , 2021, ArXiv.

[23]  Aäron van den Oord,et al.  Predicting Video with VQVAE , 2021, ArXiv.

[24]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[25]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[27]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[28]  Evgeny Burnaev,et al.  Latent Video Transformer , 2020, VISIGRAPP.

[29]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[30]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jakob Uszkoreit,et al.  Scaling Autoregressive Video Models , 2019, ICLR.

[32]  S. Levine,et al.  VideoFlow: A Conditional Flow-Based Model for Stochastic Video Generation , 2019, ICLR.

[33]  Masanori Koyama,et al.  Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN , 2018, International Journal of Computer Vision.

[34]  Nicu Sebe,et al.  First Order Motion Model for Image Animation , 2020, NeurIPS.

[35]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[37]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[38]  Jiawei He,et al.  Probabilistic Video Generation using Holistic Attribute Control , 2018, ECCV.

[39]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Wei Xiong,et al.  Learning to Generate Time-Lapse Videos Using Multi-stage Dynamic Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[43]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[45]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[46]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.