论文信息 - VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

[1] Yi Wang,et al. InternVideo: General Video Foundation Models via Generative and Discriminative Learning , 2022, ArXiv.

[2] Brais Martínez,et al. REST: REtrieve & Self-Train for generative action recognition , 2022, ArXiv.

[3] Yu-Gang Jiang,et al. OmniVL: One Foundation Model for Image-Language and Video-Language Tasks , 2022, NeurIPS.

[4] Ashish V. Thapliyal,et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, arXiv.org.

[5] Jianlong Fu,et al. CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment , 2022, ArXiv.

[6] Li Dong,et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[7] Gerard de Melo,et al. Frozen CLIP Models are Efficient Video Learners , 2022, ECCV.

[8] Haibin Ling,et al. Expanding Language-Image Pretrained Models for General Video Recognition , 2022, ECCV.

[9] Serge J. Belongie,et al. Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models , 2022, ArXiv.

[10] Ngan T. H. Le,et al. VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning , 2022, 2022 IEEE International Conference on Image Processing (ICIP).

[11] C. Schmid,et al. Zero-Shot Video Question Answering via Frozen Bidirectional Language Models , 2022, NeurIPS.

[12] Tamara L. Berg,et al. Revealing Single Frame Bias for Video-and-Language Learning , 2022, ACL.

[13] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[14] Andrew Zisserman,et al. A CLIP-Hitchhiker's Guide to Long Video Retrieval , 2022, ArXiv.

[15] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[16] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[17] C. Schmid,et al. Learning Audio-Video Modalities from Image Captions , 2022, ECCV.

[18] Adrian S. Wong,et al. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.

[19] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[20] Fabian Caba Heilbron,et al. FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks , 2022, BMVC.

[21] Mike Zheng Shou,et al. All in One: Exploring Unified Video-Language Pre-Training , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Yali Wang,et al. UniFormer: Unifying Convolution and Self-attention for Visual Recognition , 2022, ArXiv.

[23] C. Schmid,et al. End-to-end Generative Pretraining for Multimodal Video Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] C. Schmid,et al. Multiview Transformers for Video Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Weidi Xie,et al. Prompting Visual-Language Models for Efficient Video Understanding , 2021, ECCV.

[26] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[28] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[30] Tsu-Jui Fu,et al. VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , 2021, ArXiv.

[31] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[32] Dmytro Okhonko,et al. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.

[33] Mengmeng Wang,et al. ActionCLIP: A New Paradigm for Video Action Recognition , 2021, ArXiv.

[34] Yonatan Bisk,et al. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[35] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.

[36] Shizhe Chen,et al. Elaborative Rehearsal for Zero-shot Action Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] Pengfei Xiong,et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.

[38] Florian Metze,et al. VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding , 2021, FINDINGS.

[39] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[40] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.

[41] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[42] Jes'us Andr'es Portillo-Quintero,et al. A Straightforward Framework For Video Retrieval Using CLIP , 2021, MCPR.

[43] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[44] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[45] C. Schmid,et al. Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[46] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[47] Sivaji Bandyopadhyay,et al. NITS-VC System for VATEX Video Captioning Challenge 2020 , 2020, ArXiv.

[48] Mohit Bansal,et al. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning , 2020, ACL.

[49] Shizhe Chen,et al. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Bing Li,et al. Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Xilin Chen,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.

[52] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[53] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54] Jun Yu,et al. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.

[55] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[56] Yee Whye Teh,et al. Set Transformer , 2018, ICML.

[57] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[58] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[59] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[60] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[61] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[63] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.

[64] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.

[65] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[67] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[68] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[69] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.

[70] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[74] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[75] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[76] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.