Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage.

[1]  Seung Wook Kim,et al.  Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xintao Wang,et al.  Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos , 2023, AAAI.

[3]  Yang Gao,et al.  Seer: Language Instructed Video Prediction with Latent Diffusion Models , 2023, ArXiv.

[4]  Humphrey Shi,et al.  Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Qifeng Chen,et al.  FateZero: Fusing Attentions for Zero-shot Text-based Video Editing , 2023, ArXiv.

[6]  Jiaya Jia,et al.  Video-P2P: Video Editing with Cross-attention Control , 2023, ArXiv.

[7]  Xintao Wang,et al.  T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models , 2023, AAAI.

[8]  Maneesh Agrawala,et al.  Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[9]  Patrick Esser,et al.  Structure and Content-Guided Video Synthesis with Diffusion Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Y. Matias,et al.  Dreamix: Video Diffusion Models are General Video Editors , 2023, ArXiv.

[11]  Yong Jae Lee,et al.  GLIGEN: Open-Set Grounded Text-to-Image Generation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Mike Zheng Shou,et al.  Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Zhe Gan,et al.  ReCo: Region-Controlled Text-to-Image Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  S. Bagon,et al.  Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Vishal M. Patel,et al.  SceneComposer: Any-Level Semantic Image Synthesis , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jiashi Feng,et al.  MagicVideo: Efficient Video Generation With Latent Diffusion Models , 2022, ArXiv.

[17]  D. Cohen-Or,et al.  Null-text Inversion for Editing Real Images using Guided Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bryan Catanzaro,et al.  eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers , 2022, ArXiv.

[19]  Holger Schwenk,et al.  DiffEdit: Diffusion-based semantic image editing with mask guidance , 2022, ICLR.

[20]  M. Irani,et al.  Imagic: Text-Based Real Image Editing with Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[22]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[23]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[24]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[25]  Wendi Zheng,et al.  CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers , 2022, ICLR.

[26]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[27]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[28]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[29]  Tali Dekel,et al.  Text2LIVE: Text-Driven Layered Image and Video Editing , 2022, ECCV.

[30]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[32]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  B. Guo,et al.  Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[35]  Tali Dekel,et al.  Layered neural atlases for consistent video editing , 2021, ACM Trans. Graph..

[36]  S. Ermon,et al.  SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations , 2021, ICLR.

[37]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[39]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[40]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[41]  Olatunji Ruwase,et al.  DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters , 2020, KDD.

[42]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[43]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[44]  Konrad Schindler,et al.  Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Sjoerd van Steenkiste,et al.  FVD: A new Metric for Video Generation , 2019, DGS@ICLR.

[46]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[47]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[48]  Luc Van Gool,et al.  The 2017 DAVIS Challenge on Video Object Segmentation , 2017, ArXiv.

[49]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[50]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[51]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[52]  Siwei Lyu,et al.  Interpretation and Generalization of Score Matching , 2009, UAI.

[53]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..

[54]  Qifeng Chen,et al.  Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths , 2022, ArXiv.