Make-A-Video: Text-to-Video Generation without Text-Video Data

We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

[1]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[2]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[3]  Junyan Zhu,et al.  On Aliased Resizing and Surprising Subtleties in GAN Evaluation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Wendi Zheng,et al.  CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers , 2022, ICLR.

[5]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[6]  Jie Tang,et al.  CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers , 2022, NeurIPS.

[7]  Devi Parikh,et al.  MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration , 2022, ECCV.

[8]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[9]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[10]  Devi Parikh,et al.  Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer , 2022, ECCV.

[11]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[12]  Jinwoo Shin,et al.  Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks , 2022, ICLR.

[13]  B. Curless,et al.  FILM: Frame Interpolation for Large Motion , 2022, ECCV.

[14]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[16]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jian Liang,et al.  NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.

[18]  B. Guo,et al.  Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jing Yu Koh,et al.  Vector-quantized Image Modeling with Improved VQGAN , 2021, ICLR.

[20]  Guillermo Sapiro,et al.  GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions , 2021, ArXiv.

[21]  Dimitris N. Metaxas,et al.  A Good Image Generator Is What You Need for High-Resolution Video Synthesis , 2021, ICLR.

[22]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[24]  Jing Yu Koh,et al.  Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[26]  Deeptha Girish,et al.  Understanding action recognition in still images , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  Xin Wang,et al.  Cross-Modal Dual Learning for Sentence-to-Video Generation , 2019, ACM Multimedia.

[29]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[30]  Masanori Koyama,et al.  Train Sparsely, Generate Densely: Memory-Efficient Unsupervised Training of High-Resolution Temporal GAN , 2018, International Journal of Computer Vision.

[31]  Liqiang Zhang,et al.  3D Depthwise Convolution: Reducing Model Parameters in 3D Vision Tasks , 2018, Canadian Conference on AI.

[32]  Ali Farhadi,et al.  Imagine This! Scripts to Compositions to Videos , 2018, ECCV.

[33]  Seunghoon Hong,et al.  Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[35]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Tao Mei,et al.  To Create What You Tell: Generating Videos from Captions , 2017, ACM Multimedia.

[37]  Yitong Li,et al.  Video Generation From Text , 2017, AAAI.

[38]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Vineeth N. Balasubramanian,et al.  Attentive Semantic Video Generation Using Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Vineeth N. Balasubramanian,et al.  Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures , 2016, ACM Multimedia.

[43]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[45]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.