论文信息 - Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation

Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation

Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a mar-riage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demon-strate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip

[1] Xiatian Zhu,et al. Zero-Shot Temporal Action Detection via Vision-Language Prompting , 2022, ECCV.

[2] Xiatian Zhu,et al. Proposal-Free Temporal Action Detection via Global Segmentation Mask Learning , 2022, ECCV.

[3] Jiangliu Wang,et al. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition , 2022, NeurIPS.

[4] Thomas Kipf,et al. Simple Open-Vocabulary Object Detection with Vision Transformers , 2022, ArXiv.

[5] Ramalingam Chellappa,et al. Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting , 2022, ArXiv.

[6] Chen Change Loy,et al. Open-Vocabulary DETR with Conditional Matching , 2022, ECCV.

[7] Chen Change Loy,et al. Conditional Prompt Learning for Vision-Language Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] A. Schwing,et al. Masked-attention Mask Transformer for Universal Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Chen Change Loy,et al. Learning to Prompt for Vision-Language Models , 2021, International Journal of Computer Vision.

[10] Chi-Keung Tang,et al. Few-Shot Video Object Detection , 2021, ECCV.

[11] Peng Gao,et al. Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling , 2021, ArXiv.

[12] Tao Xiang,et al. Few-Shot Temporal Action Localization with Query Adaptive Transformer , 2021, BMVC.

[13] Peng Gao,et al. CLIP-Adapter: Better Vision-Language Models with Feature Adapters , 2021, Int. J. Comput. Vis..

[14] Shih-Fu Chang,et al. Query Adaptive Few-Shot Object Detection with Heterogeneous Graph Convolutional Networks , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15] Mengmeng Wang,et al. ActionCLIP: A New Paradigm for Video Action Recognition , 2021, ArXiv.

[16] Niamul Quader,et al. Class Semantics-based Attention for Action Detection , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Alexander G. Schwing,et al. Per-Pixel Classification is Not All You Need for Semantic Segmentation , 2021, NeurIPS.

[18] Bernard Ghanem,et al. Low-Fidelity Video Encoder Optimization for Temporal Action Localization , 2021, NeurIPS.

[19] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[20] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[21] Limin Wang,et al. Relaxed Transformer Decoders for Direct Action Proposal Generation , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22] Bin Li,et al. Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[23] Amit K. Roy-Chowdhury,et al. Text-Based Localization of Moments in a Video Corpus , 2020, IEEE Transactions on Image Processing.

[24] Shijian Lu,et al. Meta-DETR: Few-Shot Object Detection via Unified Image-Level Meta-Learning , 2021, ArXiv.

[25] Cees G. M. Snoek,et al. Localizing the Common Action Among a Few Videos , 2020, ECCV.

[26] Xiyang Dai,et al. METAL: Minimum Effort Temporal Activity Localization in Untrimmed Videos , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Trevor Darrell,et al. Frustratingly Simple Few-Shot Object Detection , 2020, ICML.

[28] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[29] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Ali K. Thabet,et al. G-TAD: Sub-Graph Localization for Temporal Action Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Yu-Wing Tai,et al. Few-Shot Object Detection With Attention-RPN and Multi-Relation Detector , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Shilei Wen,et al. BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33] Deyu Meng,et al. Few-Example Object Detection with Model Communication , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] Yazan Abu Farha,et al. MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Xin Wang,et al. Few-Shot Object Detection via Feature Reweighting , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Ming Yang,et al. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[37] Fatih Murat Porikli,et al. One-Shot Action Localization by Learning Sequence Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Tao Xiang,et al. Learning to Compare: Relation Network for Few-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Bernard Ghanem,et al. SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[41] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[42] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[43] Larry S. Davis,et al. Soft-NMS — Improving Object Detection with One Line of Code , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44] Kate Saenko,et al. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45] R. Nevatia,et al. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[46] Richard S. Zemel,et al. Prototypical Networks for Few-shot Learning , 2017, NIPS.

[47] Luc Van Gool,et al. UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48] Haroon Idrees,et al. The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[49] Oriol Vinyals,et al. Matching Networks for One Shot Learning , 2016, NIPS.

[50] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52] Shaogang Gong,et al. Transductive Multi-View Zero-Shot Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.