Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction. Existing few-shot video-language learners focus exclusively on the encoder, resulting in the absence of a video-to-text decoder to handle generative tasks. Video captioners have been pretrained on large-scale video-language datasets, but they rely heavily on finetuning and lack the ability to generate text for unseen tasks in a few-shot setting. We propose VidIL , a few-shot Vid eo-language Learner via I mage and L anguage models, which demonstrates strong performance on few-shot video-to-text tasks without the necessity of pretraining or finetuning on any video datasets. We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases, and compose them into a temporal structure template. We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content. The flexibility of prompting allows the model to capture any form of text input, such as automatic speech recognition (ASR) transcripts. Our experiments demonstrate the power of language models in understanding videos on a wide variety of video-language tasks, including video captioning, video question answering, video caption retrieval, and video future event prediction. Especially, on video future event prediction, our few-shot model significantly outperforms state-of-the-art supervised models trained on

[1]  Dani Yogatama,et al.  Language Models Can See: Plugging Visual Controls in Text Generation , 2022, ArXiv.

[2]  N. Codella,et al.  i-Code: An Integrative and Composable Multimodal Learning Framework , 2022, AAAI.

[3]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[4]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, ArXiv.

[5]  Adrian S. Wong,et al.  Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[6]  Xiansheng Hua,et al.  Disentangled Representation Learning for Text-Video Retrieval , 2022, ArXiv.

[7]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[8]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[9]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[10]  C. Schmid,et al.  End-to-end Generative Pretraining for Multimodal Video Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yejin Choi,et al.  MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Junnan Li,et al.  Align and Prompt: Video-and-Language Pre-training with Entity Prompts , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Xizhou Zhu,et al.  Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Zhe Gan,et al.  An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.

[15]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[16]  Adams Wei Yu,et al.  SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[17]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[18]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[19]  Dmytro Okhonko,et al.  VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.

[20]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[21]  Oriol Vinyals,et al.  Multimodal Few-Shot Learning with Frozen Language Models , 2021, NeurIPS.

[22]  Pengfei Xiong,et al.  CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.

[23]  Zhe Gan,et al.  VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation , 2021, NeurIPS Datasets and Benchmarks.

[24]  Ali Farhadi,et al.  MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.

[25]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[27]  Jianlong Fu,et al.  Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Colin Raffel,et al.  Improving and Simplifying Pattern Exploiting Training , 2021, EMNLP.

[29]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[30]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[31]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[33]  Wonjae Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[34]  Jaemin Cho,et al.  Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.

[35]  Shih-Fu Chang,et al.  VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Danqi Chen,et al.  Making Pre-trained Language Models Better Few-shot Learners , 2021, ACL.

[37]  Florian Metze,et al.  Support-set bottlenecks for video-text representation learning , 2020, ICLR.

[38]  Hao Tian,et al.  ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.

[39]  Timo Schick,et al.  Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference , 2020, EACL.

[40]  Mohit Bansal,et al.  What Is More Likely to Happen Next? Video-and-Language Future Event Prediction , 2020, EMNLP.

[41]  Andrew Zisserman,et al.  Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[42]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[43]  Licheng Yu,et al.  Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[44]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[45]  Jianfeng Gao,et al.  Few-shot Natural Language Generation for Task-Oriented Dialog , 2020, FINDINGS.

[46]  Xilin Chen,et al.  UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.

[47]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[49]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[50]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[51]  Jason J. Corso,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[52]  J. Uijlings,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[53]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[54]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[55]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[56]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Xin Wang,et al.  VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[60]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[61]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[62]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[63]  Francis Ferraro,et al.  Visual Storytelling , 2016, NAACL.

[64]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[65]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[67]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[69]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[70]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[71]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[72]  Chin-Yew Lin,et al.  Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.

[73]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[74]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.