Edit-A-Video: Single Video Editing with Object-Aware Consistency

Despite the fact that text-to-video (TTV) model has recently achieved remarkable success, there have been few approaches on TTV for its extension to video editing. Motivated by approaches on TTV models adapting from diffusion-based text-to-image (TTI) models, we suggest the video editing framework given only a pretrained TTI model and a singlepair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules and tuning on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. Each stage enables the temporal modeling and preservation of semantic attributes of the source video. One of the key challenges for video editing include a background inconsistency problem, where the regions not included for the edit suffer from undesirable and inconsistent temporal alterations. To mitigate this issue, we also introduce a novel mask blending method, termed as sparse-causal blending (SC Blending). We improve previous mask blending methods to reflect the temporal consistency so that the area where the editing is applied exhibits smooth transition while also achieving spatio-temporal consistency of the unedited regions. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.

[1]  Patrick Esser,et al.  Structure and Content-Guided Video Synthesis with Diffusion Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Y. Matias,et al.  Dreamix: Video Diffusion Models are General Video Editors , 2023, ArXiv.

[3]  Mike Zheng Shou,et al.  Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Nupur Kumari,et al.  Multi-Concept Customization of Text-to-Image Diffusion , 2022, ArXiv.

[5]  D. Cohen-Or,et al.  Null-text Inversion for Editing Real Images using Guided Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Holger Schwenk,et al.  DiffEdit: Diffusion-based semantic image editing with mask guidance , 2022, ICLR.

[7]  M. Irani,et al.  Imagic: Text-Based Real Image Editing with Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[9]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[10]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[12]  Amit H. Bermano,et al.  An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , 2022, ICLR.

[13]  Serge J. Belongie,et al.  Text-Driven Stylization of Video Objects , 2022, ECCV Workshops.

[14]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[15]  Tali Dekel,et al.  Text2LIVE: Text-Driven Layered Image and Video Editing , 2022, ECCV.

[16]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[17]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[19]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[20]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[21]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[22]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[23]  Stefano Ermon,et al.  SDEdit: Image Synthesis and Editing with Stochastic Differential Equations , 2021, ArXiv.

[24]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[25]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.