VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
暂无分享,去创建一个
Yonghui Wu | Yuan Cao | Jiahui Yu | Shen Yan | Zirui Wang | Mi Zhang | Soham Ghosh | Tao Zhu
[1] Yi Wang,et al. InternVideo: General Video Foundation Models via Generative and Discriminative Learning , 2022, ArXiv.
[2] Brais Martínez,et al. REST: REtrieve & Self-Train for generative action recognition , 2022, ArXiv.
[3] Yu-Gang Jiang,et al. OmniVL: One Foundation Model for Image-Language and Video-Language Tasks , 2022, NeurIPS.
[4] Ashish V. Thapliyal,et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, arXiv.org.
[5] Jianlong Fu,et al. CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment , 2022, ArXiv.
[6] Li Dong,et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.
[7] Gerard de Melo,et al. Frozen CLIP Models are Efficient Video Learners , 2022, ECCV.
[8] Haibin Ling,et al. Expanding Language-Image Pretrained Models for General Video Recognition , 2022, ECCV.
[9] Serge J. Belongie,et al. Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models , 2022, ArXiv.
[10] Ngan T. H. Le,et al. VLCAP: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning , 2022, 2022 IEEE International Conference on Image Processing (ICIP).
[11] C. Schmid,et al. Zero-Shot Video Question Answering via Frozen Bidirectional Language Models , 2022, NeurIPS.
[12] Tamara L. Berg,et al. Revealing Single Frame Bias for Video-and-Language Learning , 2022, ACL.
[13] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..
[14] Andrew Zisserman,et al. A CLIP-Hitchhiker's Guide to Long Video Retrieval , 2022, ArXiv.
[15] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..
[16] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[17] C. Schmid,et al. Learning Audio-Video Modalities from Image Captions , 2022, ECCV.
[18] Adrian S. Wong,et al. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language , 2022, ICLR.
[19] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.
[20] Fabian Caba Heilbron,et al. FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks , 2022, BMVC.
[21] Mike Zheng Shou,et al. All in One: Exploring Unified Video-Language Pre-Training , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Yali Wang,et al. UniFormer: Unifying Convolution and Self-attention for Visual Recognition , 2022, ArXiv.
[23] C. Schmid,et al. End-to-end Generative Pretraining for Multimodal Video Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] C. Schmid,et al. Multiview Transformers for Video Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Weidi Xie,et al. Prompting Visual-Language Models for Efficient Video Understanding , 2021, ECCV.
[26] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[27] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[28] Stephen Lin,et al. Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[30] Tsu-Jui Fu,et al. VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling , 2021, ArXiv.
[31] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[32] Dmytro Okhonko,et al. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.
[33] Mengmeng Wang,et al. ActionCLIP: A New Paradigm for Video Action Recognition , 2021, ArXiv.
[34] Yonatan Bisk,et al. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[35] Michael S. Bernstein,et al. On the Opportunities and Risks of Foundation Models , 2021, ArXiv.
[36] Shizhe Chen,et al. Elaborative Rehearsal for Zero-shot Action Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[37] Pengfei Xiong,et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.
[38] Florian Metze,et al. VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding , 2021, FINDINGS.
[39] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[40] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.
[41] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[42] Jes'us Andr'es Portillo-Quintero,et al. A Straightforward Framework For Video Retrieval Using CLIP , 2021, MCPR.
[43] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[44] Heng Wang,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[45] C. Schmid,et al. Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[46] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[47] Sivaji Bandyopadhyay,et al. NITS-VC System for VATEX Video Captioning Challenge 2020 , 2020, ArXiv.
[48] Mohit Bansal,et al. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning , 2020, ACL.
[49] Shizhe Chen,et al. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Bing Li,et al. Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Xilin Chen,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[52] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[53] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[54] Jun Yu,et al. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.
[55] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[56] Yee Whye Teh,et al. Set Transformer , 2018, ICML.
[57] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[58] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[59] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.
[60] Chen Sun,et al. Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.
[61] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[62] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.
[63] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[64] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.
[65] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[66] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.
[67] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[68] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[69] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.
[70] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[71] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[72] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[73] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[74] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.
[75] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[76] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.