论文信息 - Few-shot Action Recognition with Captioning Foundation Models

Few-shot Action Recognition with Captioning Foundation Models

Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for reliable few-shot matching. In this way, CapFSAR can benefit from powerful multimodal knowledge of pretrained foundation models, yielding more comprehensive classification in the low-shot regime. Extensive experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods and achieves state-of-the-art performance. The code will be made publicly available.

[1] Jingren Zhou,et al. VideoComposer: Compositional Video Synthesis with Motion Controllability , 2023, NeurIPS.

[2] Shiwei Zhang,et al. Cross-domain few-shot action recognition with unlabeled videos , 2023, Comput. Vis. Image Underst..

[3] Shiwei Zhang,et al. MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Action Recognition , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Shiwei Zhang,et al. CLIP-guided Prototype Modulating for Few-shot Action Recognition , 2023, International Journal of Computer Vision.

[5] Shiwei Zhang,et al. HyRSM++: Hybrid Relation Guided Temporal Set Matching for Few-shot Action Recognition , 2023, ArXiv.

[6] Mike Zheng Shou,et al. Position-Guided Text Prompt for Vision-Language Pre-Training , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Qun Liu,et al. LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling , 2022, EMNLP.

[8] S. Savarese,et al. Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training , 2022, EMNLP.

[9] Yu-Gang Jiang,et al. OmniVL: One Foundation Model for Image-Language and Video-Language Tasks , 2022, NeurIPS.

[10] Samuel Albanie,et al. RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection , 2022, NeurIPS.

[11] Quoc-Huy Tran,et al. Inductive and Transductive Few-Shot Video Classification via Appearance and Temporal Alignments , 2022, ECCV.

[12] Percy Liang,et al. Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning , 2022, ArXiv.

[13] Yifei Huang,et al. Compound Prototype Matching for Few-shot Action Recognition , 2022, ECCV.

[14] Tianzhu Zhang,et al. Motion-modulated Temporal Fragment Alignment Network For Few-Shot Action Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..

[16] Derek Hoiem,et al. Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners , 2022, NeurIPS.

[17] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[18] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[19] Shiwei Zhang,et al. Hybrid Relation Guided Set Matching for Few-shot Action Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.

[21] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[22] F. Khan,et al. Spatio-temporal Relation Modeling for Few-shot Action Recognition , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Dongdong Chen,et al. CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Weidi Xie,et al. Prompting Visual-Language Models for Efficient Video Understanding , 2021, ECCV.

[25] Lorenzo Torresani,et al. Label Hallucination for Few-Shot Classification , 2021, AAAI.

[26] Xiaowei Hu,et al. Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Li Dong,et al. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.

[29] Chen Change Loy,et al. Learning to Prompt for Vision-Language Models , 2021, International Journal of Computer Vision.

[30] Massimiliano Pontil,et al. The Role of Global Labels in Few-Shot Classification and How to Infer Them , 2021, NeurIPS.

[31] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[32] John See,et al. TA2N: Two-Stage Action Alignment Network for Few-Shot Action Recognition , 2021, AAAI.

[33] Yejin Choi,et al. VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Yongjian Wu,et al. RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Songyang Zhang,et al. Learning Implicit Temporal Alignment for Few-shot Video Classification , 2021, IJCAI.

[36] Yin Cui,et al. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , 2021, ICLR.

[37] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[38] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[39] Majid Mirmehdi,et al. Temporal-Relational CrossTransformers for Few-Shot Action Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Lingfeng Wang,et al. Few-Shot Learning via Feature Hallucination with Variational Inference , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[41] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[42] Yi Yang,et al. Label Independent Memory for Semi-Supervised Few-Shot Video Classification , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[44] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[45] Kai Li,et al. Adversarial Feature Hallucination Networks for Few-Shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Hongdong Li,et al. Few-Shot Action Recognition with Permutation-Invariant Attention , 2020, ECCV.

[47] F. Hutter,et al. Meta-Learning of Neural Architectures for Few-Shot Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Tao Xiang,et al. Few-Shot Learning With Global Class Representations , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49] Ioannis Patras,et al. TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition , 2019, BMVC.

[50] Juan Carlos Niebles,et al. Few-Shot Video Classification via Temporal Alignment , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Sung Whan Yoon,et al. TapNet: Neural Network Augmented with Task-Adaptive Projection for Few-Shot Learning , 2019, ICML.

[52] Yu-Wing Tai,et al. Memory-Attended Recurrent Network for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[54] Jing Zhang,et al. Few-Shot Learning via Saliency-Guided Hallucination of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Fei Sha,et al. Few-Shot Learning via Embedding Adaptation With Set-to-Set Functions , 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[57] Yi Yang,et al. Compound Memory Networks for Few-Shot Video Classification , 2018, ECCV.

[58] Mubarak Shah,et al. Task Agnostic Meta-Learning for Few-Shot Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Wei Liu,et al. Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60] Bolei Zhou,et al. Temporal Relational Reasoning in Videos , 2017, ECCV.

[61] Tao Xiang,et al. Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62] Hang Li,et al. Meta-SGD: Learning to Learn Quickly for Few Shot Learning , 2017, ArXiv.

[63] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65] Wei Shen,et al. Few-Shot Image Recognition by Predicting Parameters from Activations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[66] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[67] Richard S. Zemel,et al. Prototypical Networks for Few-shot Learning , 2017, NIPS.

[68] Sergey Levine,et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[69] Tao Mei,et al. Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70] Hugo Larochelle,et al. Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[71] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[72] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[73] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[74] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[75] Yoshua Bengio,et al. On Using Monolingual Corpora in Neural Machine Translation , 2015, ArXiv.

[76] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[77] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[78] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80] Tal Hassner,et al. One Shot Similarity Metric Learning for Action Recognition , 2011, SIMBAD.

[81] Prateek Jain,et al. Far-sighted active learning on a budget for image and video recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[82] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[83] Meinard Müller,et al. Information retrieval for music and motion , 2007 .

[84] Pietro Perona,et al. One-shot learning of object categories , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[85] Qin Jin,et al. Few-Shot Action Recognition with Hierarchical Matching and Contrastive Learning , 2022, ECCV.

[86] Wai Keen Vong,et al. Few-shot image classification by generating natural language rules , 2022 .

[87] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[88] Ramprasaath R. Selvaraju,et al. Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization , 2016 .

[89] Alan Bundy,et al. Dynamic Time Warping , 1984 .