ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing

In this paper, we present ControlVideo, a novel method for text-driven video editing. Leveraging the capabilities of text-to-image diffusion models and ControlNet, ControlVideo aims to enhance the fidelity and temporal consistency of videos that align with a given text while preserving the structure of the source video. This is achieved by incorporating additional conditions such as edge maps, fine-tuning the key-frame and temporal attention on the source video-text pair with carefully designed strategies. An in-depth exploration of ControlVideo's design is conducted to inform future research on one-shot tuning video diffusion models. Quantitatively, ControlVideo outperforms a range of competitive baselines in terms of faithfulness and consistency while still aligning with the textual prompt. Additionally, it delivers videos with high visual realism and fidelity w.r.t. the source content, demonstrating flexibility in utilizing controls containing varying degrees of source video information, and the potential for multiple control combinations. The project page is available at \href{https://ml.cs.tsinghua.edu.cn/controlvideo/}{https://ml.cs.tsinghua.edu.cn/controlvideo/}.

[1]  Chunhua Shen,et al.  Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models , 2023, ArXiv.

[2]  Qifeng Chen,et al.  FateZero: Fusing Attentions for Zero-shot Text-based Video Editing , 2023, ArXiv.

[3]  Sang-gil Lee,et al.  Edit-A-Video: Single Video Editing with Object-Aware Consistency , 2023, ArXiv.

[4]  Jiaya Jia,et al.  Video-P2P: Video Editing with Cross-attention Control , 2023, ArXiv.

[5]  Maneesh Agrawala,et al.  Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[6]  Patrick Esser,et al.  Structure and Content-Guided Video Synthesis with Diffusion Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Krishna Kumar Singh,et al.  Zero-shot Image-to-Image Translation , 2023, SIGGRAPH.

[8]  Y. Matias,et al.  Dreamix: Video Diffusion Models are General Video Editors , 2023, ArXiv.

[9]  Mike Zheng Shou,et al.  Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  S. Bagon,et al.  Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  D. Cohen-Or,et al.  Null-text Inversion for Editing Real Images using Guided Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[13]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[14]  Cheng Lu,et al.  DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps , 2022, NeurIPS.

[15]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[16]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[17]  Bo Zhang,et al.  Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models , 2022, ICLR.

[18]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[20]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[21]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[22]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[23]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[24]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[25]  Sergey Tulyakov,et al.  Interactive video stylization using few-shot patch-based training , 2020, ACM Trans. Graph..

[26]  Eli Shechtman,et al.  Stylizing video by example , 2019, ACM Trans. Graph..

[27]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[28]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.