Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .

[1]  Xintao Wang,et al.  T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models , 2023, AAAI.

[2]  Maneesh Agrawala,et al.  Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[3]  Patrick Esser,et al.  Structure and Content-Guided Video Synthesis with Diffusion Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Y. Matias,et al.  Dreamix: Video Diffusion Models are General Video Editors , 2023, ArXiv.

[5]  Yi-Ting Chen,et al.  Shape-Aware Text-Driven Layered Video Editing , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Dong Huk Park,et al.  More Control for Free! Image Synthesis with Semantic Diffusion Guidance , 2021, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[7]  Mike Zheng Shou,et al.  Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Alexei A. Efros,et al.  InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Humphrey Shi,et al.  Versatile Diffusion: Text, Images and Variations All in One Diffusion Model , 2022, ArXiv.

[10]  D. Erhan,et al.  Phenaki: Variable Length Video Generation From Open Domain Textual Description , 2022, ICLR.

[11]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[12]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[13]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[15]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[16]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[17]  Wendi Zheng,et al.  CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers , 2022, ICLR.

[18]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[19]  Jie Tang,et al.  CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers , 2022, NeurIPS.

[20]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[21]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[22]  Tali Dekel,et al.  Text2LIVE: Text-Driven Layered Image and Video Editing , 2022, ECCV.

[23]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[24]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[26]  Jian Liang,et al.  NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[27]  Haibin Ling,et al.  Salient Object Detection in the Deep Learning Era: An In-Depth Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[29]  Ronan Le Bras,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[30]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[31]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[32]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[34]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[35]  Stefano Ermon,et al.  SDEdit: Image Synthesis and Editing with Stochastic Differential Equations , 2021, ArXiv.

[36]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[37]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Bernt Schiele,et al.  Learning What and Where to Draw , 2016, NIPS.

[42]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[43]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.