Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
暂无分享,去创建一个
[1] Wanli Ouyang,et al. UATVR: Uncertainty-Adaptive Text-Video Retrieval , 2023, arXiv.org.
[2] Haipeng Luo,et al. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Xiaohan Wang,et al. Symbiotic Attention for Egocentric Action Recognition With Object-Centric Alignment , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[4] Yi Yang,et al. A Differentiable Parallel Sampler for Efficient Video Classification , 2022, ACM Transactions on Multimedia Computing, Communications, and Applications.
[5] Gerard de Melo,et al. Frozen CLIP Models are Efficient Video Learners , 2022, ECCV.
[6] Haibin Ling,et al. Expanding Language-Image Pretrained Models for General Video Recognition , 2022, ECCV.
[7] Jungong Han,et al. Temporal Saliency Query Network for Efficient Video Recognition , 2022, ECCV.
[8] Wanli Ouyang,et al. NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition , 2022, ECCV.
[9] Wanli Ouyang,et al. Revisiting Classifier: Transferring Vision-Language Models for Video Recognition , 2022, AAAI.
[10] Hongsheng Li,et al. ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition , 2022, NeurIPS.
[11] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..
[12] Yi Yang,et al. CenterCLIP: Token Clustering for Efficient Text-Video Retrieval , 2022, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
[13] C. Schmid,et al. Multiview Transformers for Video Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Weidi Xie,et al. Prompting Visual-Language Models for Efficient Video Understanding , 2021, ECCV.
[15] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[18] Andrew M. Dai,et al. Co-training Transformer with Videos and Images Improves Action Recognition , 2021, ArXiv.
[19] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[20] Heng Wang,et al. Interactive Prototype Learning for Egocentric Action Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[21] Mengmeng Wang,et al. ActionCLIP: A New Paradigm for Video Action Recognition , 2021, ArXiv.
[22] Shizhe Chen,et al. Elaborative Rehearsal for Zero-shot Action Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[23] A. Piergiovanni,et al. TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? , 2021, ArXiv.
[24] Wenhao Wu,et al. DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning , 2021, ACM Multimedia.
[25] Shiji Song,et al. Adaptive Focus for Efficient Video Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[26] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[27] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[28] Enhua Wu,et al. Transformer in Transformer , 2021, NeurIPS.
[29] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[30] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[31] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[32] Limin Wang,et al. TDN: Temporal Difference Networks for Efficient Action Recognition , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Chuang Gan,et al. MVFNet: Multi-View Fusion Network for Efficient Video Recognition , 2020, AAAI.
[34] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[35] Tong Lu,et al. TAM: Temporal Adaptive Module for Video Recognition , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[36] Gregory D. Hager,et al. DASZL: Dynamic Action Signatures for Zero-shot Learning , 2019, AAAI.
[37] A. Piergiovanni,et al. TokenLearner: Adaptive Space-Time Tokenization for Videos , 2021, NeurIPS.
[38] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[40] Christoph Feichtenhofer,et al. X3D: Expanding Architectures for Efficient Video Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Bin Kang,et al. TEA: Temporal Excitation and Aggregation for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Pietro Perona,et al. Rethinking Zero-Shot Video Classification: End-to-End Training for Realistic Applications , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[43] Shilei Wen,et al. Dynamic Inference: A New Approach Toward Efficient Video Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[44] K. Grauman,et al. Listen to Look: Action Recognition by Previewing Audio , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Feiyue Huang,et al. TEINet: Towards an Efficient Architecture for Video Recognition , 2019, AAAI.
[46] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[47] Wei Wu,et al. STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[48] Wenhao Wu,et al. Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[49] Changsheng Xu,et al. I Know the Relationships: Zero-Shot Action Recognition via Two-Stream Graph Convolutional Networks and Knowledge Graphs , 2019, AAAI.
[50] Maosong Sun,et al. ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.
[51] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[52] Chuang Gan,et al. TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[53] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[54] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[55] Andrew Zisserman,et al. A Short Note about Kinetics-600 , 2018, ArXiv.
[56] Piyush Rai,et al. A Generative Approach to Zero-Shot and Few-Shot Action Recognition , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).
[57] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.
[58] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[59] Limin Wang,et al. Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[60] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[61] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[62] Chen Sun,et al. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[63] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[64] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[65] Richard P. Wildes,et al. Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.
[66] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.
[67] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[68] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[69] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[70] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[71] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.