Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

[1]  Chunhua Shen,et al.  Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models , 2023, ArXiv.

[2]  Humphrey Shi,et al.  Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  N. Mitra,et al.  Pix2Video: Video Editing using Image Diffusion , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Qifeng Chen,et al.  FateZero: Fusing Attentions for Zero-shot Text-based Video Editing , 2023, ArXiv.

[5]  Sang-gil Lee,et al.  Edit-A-Video: Single Video Editing with Object-Aware Consistency , 2023, ArXiv.

[6]  Jiaya Jia,et al.  Video-P2P: Video Editing with Cross-attention Control , 2023, ArXiv.

[7]  Maneesh Agrawala,et al.  Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[8]  Mike Zheng Shou,et al.  Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  D. Cohen-Or,et al.  Null-text Inversion for Editing Real Images using Guided Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Alexei A. Efros,et al.  InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[12]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[13]  Radu Tudor Ionescu,et al.  Diffusion Models in Vision: A Survey , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Amit H. Bermano,et al.  An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , 2022, ICLR.

[16]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[17]  Dani Lischinski,et al.  Blended Latent Diffusion , 2022, ACM Trans. Graph..

[18]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[19]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[20]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[21]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[22]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[24]  D. Tao,et al.  GMFlow: Learning Optical Flow via Global Matching , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  S. Ermon,et al.  SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations , 2021, ICLR.

[26]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[27]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[28]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[29]  Jing Yu Koh,et al.  Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[32]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[33]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[34]  Eli Shechtman,et al.  Stylizing video by example , 2019, ACM Trans. Graph..

[35]  Fabrice Neyret,et al.  High-Performance By-Example Noise using a Histogram-Preserving Blending Operator , 2018, PACMCGIT.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.