NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion architecture for eXtremely Long video generation. Most current work generates long videos segment by segment sequentially, which normally leads to the gap between training on short videos and inferring long videos, and the sequential generation is inefficient. Instead, our approach adopts a ``coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity. A global diffusion model is applied to generate the keyframes across the entire time range, and then local diffusion models recursively fill in the content between nearby frames. This simple yet effective strategy allows us to directly train on long videos (3376 frames) to reduce the training-inference gap, and makes it possible to generate all segments in parallel. To evaluate our model, we build FlintstonesHD dataset, a new benchmark for long video generation. Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s (by 94.26\%) at the same hardware setting when generating 1024 frames. The homepage link is \url{https://msra-nuwa.azurewebsites.net/}

[1]  Jiashi Feng,et al.  MagicVideo: Efficient Video Generation With Latent Diffusion Models , 2022, ArXiv.

[2]  D. Erhan,et al.  Phenaki: Variable Length Video Generation From Open Domain Textual Description , 2022, ICLR.

[3]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[4]  Zhe Gan,et al.  NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis , 2022, NeurIPS.

[5]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[6]  Wendi Zheng,et al.  CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers , 2022, ICLR.

[7]  Zhe Gan,et al.  GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[8]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[9]  Frank Wood,et al.  Flexible Diffusion Modeling of Long Videos , 2022, NeurIPS.

[10]  Vikram S. Voleti,et al.  MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation , 2022, ArXiv.

[11]  Jie Tang,et al.  CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers , 2022, NeurIPS.

[12]  Devi Parikh,et al.  Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer , 2022, ECCV.

[13]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[14]  S. Mandt,et al.  Diffusion Probabilistic Modeling for Video Generation , 2022, Entropy.

[15]  W. Freeman,et al.  MaskGIT: Masked Generative Image Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jian Liang,et al.  NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[18]  Qifeng Chen,et al.  Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths , 2022, ArXiv.

[19]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[20]  Guillermo Sapiro,et al.  GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions , 2021, ArXiv.

[21]  Pieter Abbeel,et al.  VideoGPT: Video Generation using VQ-VAE and Transformers , 2021, ArXiv.

[22]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[23]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[24]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[25]  Yitong Li,et al.  Video Generation From Text , 2017, AAAI.

[26]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Tao Mei,et al.  To Create What You Tell: Generating Videos from Captions , 2017, ACM Multimedia.

[28]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Vineeth N. Balasubramanian,et al.  Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures , 2016, ACM Multimedia.

[31]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.