Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring
暂无分享,去创建一个
Thomas H. Li | Jiashi Feng | Xing Wu | Ruyang Liu | Jingjia Huang | Ge Li
[1] Haibin Ling,et al. Expanding Language-Image Pretrained Models for General Video Recognition , 2022, ECCV.
[2] Rongrong Ji,et al. Clover: Towards A Unified Video-Language Alignment and Fusion Model , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Hongsheng Li,et al. ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition , 2022, NeurIPS.
[4] Hao Liu,et al. Contextual Debiasing for Visual Recognition with Causal Mechanisms , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Yi Yang,et al. CenterCLIP: Token Clustering for Efficient Text-Video Retrieval , 2022, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.
[6] Animesh Garg,et al. X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Xiansheng Hua,et al. Disentangled Representation Learning for Text-Video Retrieval , 2022, ArXiv.
[8] Mike Zheng Shou,et al. All in One: Exploring Unified Video-Language Pre-Training , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Xihui Liu,et al. Bridging Video-text Retrieval with Multiple Choice Questions , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Samuel Albanie,et al. Cross Modal Retrieval with Querybank Normalisation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Weidi Xie,et al. Prompting Visual-Language Models for Efficient Video Understanding , 2021, ECCV.
[12] J. Malik,et al. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] B. Guo,et al. Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Chen Change Loy,et al. Learning to Prompt for Vision-Language Models , 2021, International Journal of Computer Vision.
[15] Quanfu Fan,et al. Can an Image Classifier Suffice For Action Recognition? , 2021, International Conference on Learning Representations.
[16] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[18] Zhifeng Li,et al. HunYuan_tvr for Text-Video Retrievial , 2022, ArXiv.
[19] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[20] Mengmeng Wang,et al. ActionCLIP: A New Paradigm for Video Action Recognition , 2021, ArXiv.
[21] Fan Yang,et al. Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss , 2021, ArXiv.
[22] Pengfei Xiong,et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.
[23] Georgios Tzimiropoulos,et al. Space-time Mixing Attention for Video Transformer , 2021, NeurIPS.
[24] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[25] Daniel Cohen-Or,et al. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[26] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[27] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[28] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[30] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[31] Francis E. H. Tay,et al. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[32] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[33] Lorenzo Torresani,et al. Only Time Can Tell: Discovering Temporal Data for Temporal Modeling , 2019, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[34] A. Piergiovanni,et al. TokenLearner: Adaptive Space-Time Tokenization for Videos , 2021, NeurIPS.
[35] Zijian Gao,et al. CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval , 2021, ArXiv.
[36] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .
[37] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[38] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.
[40] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[41] Susanne Westphal,et al. The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[42] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[43] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[45] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.
[47] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[49] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[50] Barbara Plank,et al. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , 2011 .