Few-shot Action Recognition with Captioning Foundation Models

Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for reliable few-shot matching. In this way, CapFSAR can benefit from powerful multimodal knowledge of pretrained foundation models, yielding more comprehensive classification in the low-shot regime. Extensive experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods and achieves state-of-the-art performance. The code will be made publicly available.

[1]  Jingren Zhou,et al.  VideoComposer: Compositional Video Synthesis with Motion Controllability , 2023, NeurIPS.

[2]  Shiwei Zhang,et al.  Cross-domain few-shot action recognition with unlabeled videos , 2023, Comput. Vis. Image Underst..

[3]  Shiwei Zhang,et al.  MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Action Recognition , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Shiwei Zhang,et al.  CLIP-guided Prototype Modulating for Few-shot Action Recognition , 2023, International Journal of Computer Vision.

[5]  Shiwei Zhang,et al.  HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot Action Recognition , 2023, ArXiv.

[6]  Mike Zheng Shou,et al.  Position-Guided Text Prompt for Vision-Language Pre-Training , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Qun Liu,et al.  LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling , 2022, EMNLP.

[8]  S. Savarese,et al.  Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training , 2022, EMNLP.

[9]  Yu-Gang Jiang,et al.  OmniVL: One Foundation Model for Image-Language and Video-Language Tasks , 2022, NeurIPS.

[10]  Samuel Albanie,et al.  RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection , 2022, NeurIPS.

[11]  Quoc-Huy Tran,et al.  Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments , 2022, ECCV.

[12]  Percy Liang,et al.  Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning , 2022, ArXiv.

[13]  Yifei Huang,et al.  Compound Prototype Matching for Few-shot Action Recognition , 2022, ECCV.

[14]  Tianzhu Zhang,et al.  Motion-modulated Temporal Fragment Alignment Network For Few-Shot Action Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Zhe Gan,et al.  GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[16]  Derek Hoiem,et al.  Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners , 2022, NeurIPS.

[17]  Zirui Wang,et al.  CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[18]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[19]  Shiwei Zhang,et al.  Hybrid Relation Guided Set Matching for Few-shot Action Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jingren Zhou,et al.  OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[21]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[22]  F. Khan,et al.  Spatio-temporal Relation Modeling for Few-shot Action Recognition , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Dongdong Chen,et al.  CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Weidi Xie,et al.  Prompting Visual-Language Models for Efficient Video Understanding , 2021, ECCV.

[25]  Lorenzo Torresani,et al.  Label Hallucination for Few-Shot Classification , 2021, AAAI.

[26]  Xiaowei Hu,et al.  Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Daniel Keysers,et al.  LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Li Dong,et al.  VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.

[29]  Chen Change Loy,et al.  Learning to Prompt for Vision-Language Models , 2021, International Journal of Computer Vision.

[30]  Massimiliano Pontil,et al.  The Role of Global Labels in Few-Shot Classification and How to Infer Them , 2021, NeurIPS.

[31]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[32]  John See,et al.  TA2N: Two-Stage Action Alignment Network for Few-Shot Action Recognition , 2021, AAAI.

[33]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yongjian Wu,et al.  RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Songyang Zhang,et al.  Learning Implicit Temporal Alignment for Few-shot Video Classification , 2021, IJCAI.

[36]  Yin Cui,et al.  Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , 2021, ICLR.

[37]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[38]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[39]  Majid Mirmehdi,et al.  Temporal-Relational CrossTransformers for Few-Shot Action Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Lingfeng Wang,et al.  Few-Shot Learning via Feature Hallucination with Variational Inference , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[42]  Yi Yang,et al.  Label Independent Memory for Semi-Supervised Few-Shot Video Classification , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[44]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[45]  Kai Li,et al.  Adversarial Feature Hallucination Networks for Few-Shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Hongdong Li,et al.  Few-Shot Action Recognition with Permutation-Invariant Attention , 2020, ECCV.

[47]  F. Hutter,et al.  Meta-Learning of Neural Architectures for Few-Shot Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Tao Xiang,et al.  Few-Shot Learning With Global Class Representations , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Ioannis Patras,et al.  TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition , 2019, BMVC.

[50]  Juan Carlos Niebles,et al.  Few-Shot Video Classification via Temporal Alignment , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Sung Whan Yoon,et al.  TapNet: Neural Network Augmented with Task-Adaptive Projection for Few-Shot Learning , 2019, ICML.

[52]  Yu-Wing Tai,et al.  Memory-Attended Recurrent Network for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[54]  Jing Zhang,et al.  Few-Shot Learning via Saliency-Guided Hallucination of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Fei Sha,et al.  Few-Shot Learning via Embedding Adaptation With Set-to-Set Functions , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57]  Yi Yang,et al.  Compound Memory Networks for Few-Shot Video Classification , 2018, ECCV.

[58]  Mubarak Shah,et al.  Task Agnostic Meta-Learning for Few-Shot Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Wei Liu,et al.  Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[61]  Tao Xiang,et al.  Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62]  Hang Li,et al.  Meta-SGD: Learning to Learn Quickly for Few Shot Learning , 2017, ArXiv.

[63]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65]  Wei Shen,et al.  Few-Shot Image Recognition by Predicting Parameters from Activations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[68]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[69]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[71]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[72]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[73]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[74]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75]  Yoshua Bengio,et al.  On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[76]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[77]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[78]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Tal Hassner,et al.  One Shot Similarity Metric Learning for Action Recognition , 2011, SIMBAD.

[81]  Prateek Jain,et al.  Far-sighted active learning on a budget for image and video recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[82]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[83]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[84]  Pietro Perona,et al.  One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[85]  Qin Jin,et al.  Few-Shot Action Recognition with Hierarchical Matching and Contrastive Learning , 2022, ECCV.

[86]  Wai Keen Vong,et al.  Few-shot image classification by generating natural language rules , 2022 .

[87]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[88]  Ramprasaath R. Selvaraju,et al.  Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization , 2016 .

[89]  Alan Bundy,et al.  Dynamic Time Warping , 1984 .