ControlVideo: Training-free Controllable Text-to-Video Generation

Text-driven diffusion models have unlocked unprecedented abilities in image generation, whereas their video counterpart still lags behind due to the excessive training cost of temporal modeling. Besides the training burden, the generated videos also suffer from appearance inconsistency and structural flickers, especially in long video synthesis. To address these challenges, we design a \emph{training-free} framework called \textbf{ControlVideo} to enable natural and efficient text-to-video generation. ControlVideo, adapted from ControlNet, leverages coarsely structural consistency from input motion sequences, and introduces three modules to improve video generation. Firstly, to ensure appearance coherence between frames, ControlVideo adds fully cross-frame interaction in self-attention modules. Secondly, to mitigate the flicker effect, it introduces an interleaved-frame smoother that employs frame interpolation on alternated frames. Finally, to produce long videos efficiently, it utilizes a hierarchical sampler that separately synthesizes each short clip with holistic coherency. Empowered with these modules, ControlVideo outperforms the state-of-the-arts on extensive motion-prompt pairs quantitatively and qualitatively. Notably, thanks to the efficient designs, it generates both short and long videos within several minutes using one NVIDIA 2080Ti. Code is available at https://github.com/YBYBZhang/ControlVideo.

[1]  Xintao Wang,et al.  Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos , 2023, AAAI.

[2]  Humphrey Shi,et al.  Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Sylvain Paris,et al.  Scaling up GANs for Text-to-Image Synthesis , 2023, ArXiv.

[4]  Lei Zhang,et al.  ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation , 2023, ArXiv.

[5]  Xintao Wang,et al.  T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models , 2023, AAAI.

[6]  Maneesh Agrawala,et al.  Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[7]  Patrick Esser,et al.  Structure and Content-Guided Video Synthesis with Diffusion Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Andreas Geiger,et al.  StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis , 2023, ICML.

[9]  W. Freeman,et al.  Muse: Text-To-Image Generation via Masked Generative Transformers , 2023, ICML.

[10]  Mike Zheng Shou,et al.  Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Nupur Kumari,et al.  Multi-Concept Customization of Text-to-Image Diffusion , 2022, ArXiv.

[12]  Bryan Catanzaro,et al.  eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers , 2022, ArXiv.

[13]  W. Zuo,et al.  ImaginaryNet: Learning Object Detectors without Real Images and Annotations , 2022, ArXiv.

[14]  D. Erhan,et al.  Phenaki: Variable Length Video Generation From Open Domain Textual Description , 2022, ICLR.

[15]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[16]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[17]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Amit H. Bermano,et al.  An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , 2022, ICLR.

[19]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[20]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[21]  Wendi Zheng,et al.  CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers , 2022, ICLR.

[22]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[23]  Jie Tang,et al.  CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers , 2022, NeurIPS.

[24]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[25]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[26]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[28]  Jian Liang,et al.  NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[29]  B. Guo,et al.  Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Konrad Schindler,et al.  Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Grzegorz Sarwas,et al.  FastRIFE: Optimization of Real-Time Intermediate Flow Estimation for Video Frame Interpolation , 2021, J. WSCG.

[32]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[33]  Guillermo Sapiro,et al.  GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions , 2021, ArXiv.

[34]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[35]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[36]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[37]  Stefano Ermon,et al.  SDEdit: Image Synthesis and Editing with Stochastic Differential Equations , 2021, ArXiv.

[38]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[39]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[40]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.