Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of existing video clips and synthesize a coherent storytelling video by customizing their appearances. We achieve this by developing a framework comprised of two functional modules: (i) Motion Structure Retrieval, which provides video candidates with desired scene or motion context described by query texts, and (ii) Structure-Guided Text-to-Video Synthesis, which generates plot-aligned videos under the guidance of motion structure and text prompts. For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure. For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters. The videos are synthesized by following the structural guidance and appearance instruction. To ensure visual consistency across clips, we propose an effective concept personalization approach, which allows the specification of the desired character identities through text prompts. Extensive experiments demonstrate that our approach exhibits significant advantages over various existing baselines.

[1]  Chen Change Loy,et al.  Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation , 2023, ArXiv.

[2]  Jingren Zhou,et al.  VideoComposer: Compositional Video Synthesis with Motion Controllability , 2023, NeurIPS.

[3]  Xintao Wang,et al.  Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance , 2023, IEEE transactions on visualization and computer graphics.

[4]  Fu Lee Wang,et al.  Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising , 2023, ArXiv.

[5]  D. Cohen-Or,et al.  A Neural Space-Time Representation for Text-to-Image Personalization , 2023, ACM Trans. Graph..

[6]  W. Zuo,et al.  ControlVideo: Training-free Controllable Text-to-Video Generation , 2023, ArXiv.

[7]  Seung Wook Kim,et al.  Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xintao Wang,et al.  Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos , 2023, AAAI.

[9]  Humphrey Shi,et al.  Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Yan Huang,et al.  VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[12]  Lei Zhang,et al.  ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation , 2023, ArXiv.

[13]  Jingren Zhou,et al.  Composer: Creative and Controllable Image Synthesis with Composable Conditions , 2023, ICML.

[14]  Jinwoo Shin,et al.  Video Probabilistic Diffusion Models in Projected Latent Space , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Kihyuk Sohn,et al.  MAGVIT: Masked Generative Video Transformer , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Nupur Kumari,et al.  Multi-Concept Customization of Text-to-Image Diffusion , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Vishal M. Patel,et al.  VIDM: Video Implicit Diffusion Models , 2022, AAAI.

[18]  D. Cohen-Or,et al.  Sketch-Guided Text-to-Image Diffusion Models , 2022, SIGGRAPH.

[19]  Jiashi Feng,et al.  MagicVideo: Efficient Video Generation With Latent Diffusion Models , 2022, ArXiv.

[20]  D. Erhan,et al.  Phenaki: Variable Length Video Generation From Open Domain Textual Description , 2022, ICLR.

[21]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[22]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[23]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Amit H. Bermano,et al.  An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , 2022, ICLR.

[25]  Wendi Zheng,et al.  CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers , 2022, ICLR.

[26]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[27]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[28]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[29]  Devi Parikh,et al.  Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer , 2022, ECCV.

[30]  Mohamed Elhoseiny,et al.  StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[33]  Guillermo Sapiro,et al.  GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions , 2021, ArXiv.

[34]  Pieter Abbeel,et al.  VideoGPT: Video Generation using VQ-VAE and Transformers , 2021, ArXiv.

[35]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[37]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[38]  Konrad Schindler,et al.  Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[40]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[42]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[43]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[44]  Qifeng Chen,et al.  Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths , 2022, ArXiv.