论文信息 - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400&600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .

[1] Wanli Ouyang,et al. UATVR: Uncertainty-Adaptive Text-Video Retrieval , 2023, arXiv.org.

[2] Haipeng Luo,et al. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Xiaohan Wang,et al. Symbiotic Attention for Egocentric Action Recognition With Object-Centric Alignment , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Yi Yang,et al. A Differentiable Parallel Sampler for Efficient Video Classification , 2022, ACM Transactions on Multimedia Computing, Communications, and Applications.

[5] Gerard de Melo,et al. Frozen CLIP Models are Efficient Video Learners , 2022, ECCV.

[6] Haibin Ling,et al. Expanding Language-Image Pretrained Models for General Video Recognition , 2022, ECCV.

[7] Jungong Han,et al. Temporal Saliency Query Network for Efficient Video Recognition , 2022, ECCV.

[8] Wanli Ouyang,et al. NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition , 2022, ECCV.

[9] Wanli Ouyang,et al. Revisiting Classifier: Transferring Vision-Language Models for Video Recognition , 2022, AAAI.

[10] Hongsheng Li,et al. ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition , 2022, NeurIPS.

[11] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[12] Yi Yang,et al. CenterCLIP: Token Clustering for Efficient Text-Video Retrieval , 2022, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[13] C. Schmid,et al. Multiview Transformers for Video Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Weidi Xie,et al. Prompting Visual-Language Models for Efficient Video Understanding , 2021, ECCV.

[15] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[18] Andrew M. Dai,et al. Co-training Transformer with Videos and Images Improves Action Recognition , 2021, ArXiv.

[19] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[20] Heng Wang,et al. Interactive Prototype Learning for Egocentric Action Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Mengmeng Wang,et al. ActionCLIP: A New Paradigm for Video Action Recognition , 2021, ArXiv.

[22] Shizhe Chen,et al. Elaborative Rehearsal for Zero-shot Action Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23] A. Piergiovanni,et al. TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? , 2021, ArXiv.

[24] Wenhao Wu,et al. DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning , 2021, ACM Multimedia.

[25] Shiji Song,et al. Adaptive Focus for Efficient Video Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[28] Enhua Wu,et al. Transformer in Transformer , 2021, NeurIPS.

[29] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[30] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[31] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[32] Limin Wang,et al. TDN: Temporal Difference Networks for Efficient Action Recognition , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Chuang Gan,et al. MVFNet: Multi-View Fusion Network for Efficient Video Recognition , 2020, AAAI.

[34] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[35] Tong Lu,et al. TAM: Temporal Adaptive Module for Video Recognition , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Gregory D. Hager,et al. DASZL: Dynamic Action Signatures for Zero-shot Learning , 2019, AAAI.

[37] A. Piergiovanni,et al. TokenLearner: Adaptive Space-Time Tokenization for Videos , 2021, NeurIPS.

[38] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[39] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[40] Christoph Feichtenhofer,et al. X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Bin Kang,et al. TEA: Temporal Excitation and Aggregation for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Pietro Perona,et al. Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43] Shilei Wen,et al. Dynamic Inference: A New Approach Toward Efficient Video Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[44] K. Grauman,et al. Listen to Look: Action Recognition by Previewing Audio , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Feiyue Huang,et al. TEINet: Towards an Efficient Architecture for Video Recognition , 2019, AAAI.

[46] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[47] Wei Wu,et al. STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48] Wenhao Wu,et al. Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49] Changsheng Xu,et al. I Know the Relationships: Zero-Shot Action Recognition via Two-Stream Graph Convolutional Networks and Knowledge Graphs , 2019, AAAI.

[50] Maosong Sun,et al. ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[51] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[53] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[54] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[55] Andrew Zisserman,et al. A Short Note about Kinetics-600 , 2018, ArXiv.

[56] Piyush Rai,et al. A Generative Approach to Zero-Shot and Few-Shot Action Recognition , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[57] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[58] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59] Limin Wang,et al. Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[65] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[66] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[67] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[68] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[70] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[71] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.