Analyzing Zero-Shot Abilities of Vision-Language Models on Video Understanding Tasks

Foundational multimodal models pre-trained on large scale image-text pairs or video-text pairs or both have shown strong generalization abilities on downstream tasks. However unlike image-text models, pretraining video-text models is always not feasible due to the difficulty in collecting large-scale clean and aligned data, and exponential computational costs involved in the pretraining phase. Therefore, the pertinent question to ask is: Can image-text models be adapted to video tasks and is there any benefit to using these models over pretraining directly on videos? In this work, we focus on this question by proposing a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting. We investigate 9 foundational image-text models on a diverse set of video tasks that include video action recognition (video AR), video retrieval (video RT), video question answering (video QA), video multiple choice (video MC) and video captioning (video CP). Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC. Furthermore, they perform moderately on video captioning and poorly on video QA. These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.

[1]  Tongliang Liu,et al.  ALIP: Adaptive Language-Image Pre-training with Synthetic Caption , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Yonglong Tian,et al.  Improving CLIP Training with Language Rewrites , 2023, NeurIPS.

[3]  Boyang Li,et al.  InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , 2023, NeurIPS.

[4]  H. Bischof,et al.  MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge , 2023, ArXiv.

[5]  C. Schmid,et al.  Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[7]  Julian Martin Eisenschlos,et al.  Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding , 2022, ICML.

[8]  William Yang Wang,et al.  An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Zhe Gan,et al.  LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jianfeng Gao,et al.  Unified Contrastive Learning in Image-Text-Label Space , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[12]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[13]  Yejin Choi,et al.  MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Saining Xie,et al.  SLIP: Self-supervision meets Language-Image Pre-training , 2021, ECCV.

[15]  Junnan Li,et al.  Align and Prompt: Video-and-Language Pre-training with Entity Prompts , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Tsu-Jui Fu,et al.  VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , 2021, ArXiv.

[17]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[18]  Zi-Yi Dou,et al.  An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, Computer Vision and Pattern Recognition.

[19]  Junjie Yan,et al.  Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm , 2021, ICLR.

[20]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[21]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[23]  Radu Soricut,et al.  Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hua Wu,et al.  UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2020, ACL.

[25]  Andrew Zisserman,et al.  A Short Note on the Kinetics-700-2020 Human Action Dataset , 2020, ArXiv.

[26]  Licheng Yu,et al.  Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[27]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[28]  Mohit Bansal,et al.  TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.

[29]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[30]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Anoop Cherian,et al.  Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[33]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[34]  Yueting Zhuang,et al.  Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[35]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Yale Song,et al.  TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Chenliang Xu,et al.  Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[38]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[41]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[43]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[44]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[45]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[46]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[47]  David L. Chen,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[48]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[50]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[51]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[52]  Xinlei Chen,et al.  nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[54]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.