CoVR: Learning Composed Video Retrieval from Web Video Captions
暂无分享,去创建一个
[1] Eric Michael Smith,et al. Llama 2: Open Foundation and Fine-Tuned Chat Models , 2023, ArXiv.
[2] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, NeurIPS.
[3] A. Bimbo,et al. Zero-Shot Composed Image Retrieval with Textual Inversion , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).
[4] Sanghyuk Chun,et al. CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion , 2023, ArXiv.
[5] Rami Ben-Ari,et al. Data Roaming and Quality Assessment for Composed Image Retrieval , 2023, 2303.09429.
[6] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[7] Kate Saenko,et al. Pic2Word: Mapping Pictures to Words for Zero-shot Composed Image Retrieval , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.
[9] D. Mahajan,et al. Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Ishan Misra,et al. Learning Video Representations from Large Language Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] F. Khan,et al. Fine-tuned CLIP Models are Efficient Video Learners , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Alexei A. Efros,et al. InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Ludwig Schmidt,et al. LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.
[14] Jianlong Fu,et al. Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning , 2022, NeurIPS.
[15] Jianlong Fu,et al. CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Alignment , 2022, ICLR.
[16] Luhui Xu,et al. TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval , 2022, ECCV.
[17] Ming Yan,et al. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval , 2022, ACM Multimedia.
[18] A. Bimbo,et al. Effective conditioned and composed image retrieval combining CLIP-based features , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] P. Natarajan,et al. FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Andrew Zisserman,et al. A CLIP-Hitchhiker's Guide to Long Video Retrieval , 2022, ArXiv.
[21] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[22] Rafael Sampaio de Rezende,et al. ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity , 2022, 2203.08101.
[23] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[24] Yejin Choi,et al. MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Junnan Li,et al. Align and Prompt: Video-and-Language Pre-training with Entity Prompts , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.
[27] B. Guo,et al. Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Zhenguo Li,et al. FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.
[29] Dmytro Okhonko,et al. VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.
[30] Yonatan Bisk,et al. TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Stephen Gould,et al. Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[32] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[33] Pengfei Xiong,et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.
[34] Ali Farhadi,et al. MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.
[35] Bohyung Han,et al. CoSMo: Content-Style Modulation for Image Retrieval with Text Feedback , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Gunhee Kim,et al. Dual Compositional Learning in Interactive Image Retrieval , 2021, AAAI.
[37] Shih-Fu Chang,et al. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.
[38] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[39] Geonmo Gu,et al. RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network , 2021, ArXiv.
[40] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[41] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[42] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[43] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[44] C. Schmid,et al. Just Ask: Learning to Answer Questions from Millions of Narrated Videos , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[45] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[46] Balaji Krishnamurthy,et al. SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval , 2020, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
[47] Ayush Chopra,et al. TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback , 2020, ArXiv.
[48] Loris Bazzani,et al. Learning Joint Visual Semantic Matching Embeddings for Language-Guided Retrieval , 2020, ECCV.
[49] Yang Zhang,et al. Modality-Agnostic Attention Fusion for visual search with text feedback , 2020, ArXiv.
[50] Shaogang Gong,et al. Image Search With Text Feedback by Visiolinguistic Attention Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[51] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[52] Licheng Yu,et al. Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.
[53] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[54] Cordelia Schmid,et al. Speech2Action: Cross-Modal Supervision for Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Gunhee Kim,et al. CurlingNet: Compositional Learning between Images and Text for Fashion IQ Data , 2020, ArXiv.
[56] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[57] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[58] Jason J. Corso,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.
[59] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[60] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[61] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[62] Steven J. Rennie,et al. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[63] Li Fei-Fei,et al. Composing Text and Image for Image Retrieval - an Empirical Odyssey , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[64] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.
[65] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[66] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[67] Yann Dauphin,et al. Hierarchical Neural Story Generation , 2018, ACL.
[68] Rogério Schmidt Feris,et al. Dialog-based Interactive Image Retrieval , 2018, NeurIPS.
[69] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[70] Jeff Johnson,et al. Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.
[71] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.
[72] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[73] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[74] Ping Luo,et al. BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions , 2022, ArXiv.
[75] Zijian Gao,et al. CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval , 2021, ArXiv.