VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval
暂无分享,去创建一个
[1] Zhijun Fang,et al. An effective CNN and Transformer complementary network for medical image segmentation , 2022, Pattern Recognit..
[2] G. Cosma,et al. Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval , 2022, Pattern Recognit..
[3] Y. Fu,et al. Image-Text Embedding Learning via Visual and Textual Semantic Reasoning , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[4] Hwanjun Song,et al. e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce , 2022, CIKM.
[5] Zhangyang Wang,et al. EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] A. Bimbo,et al. Effective conditioned and composed image retrieval combining CLIP-based features , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Yizhao Gao,et al. COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Errui Ding,et al. ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Xiaodan Liang,et al. Atom correlation based graph propagation for scene graph generation , 2022, Pattern Recognit..
[10] Lu Yuan,et al. RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Lu Yuan,et al. HairCLIP: Design Your Hair by Text and Reference Image , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Tongliang Liu,et al. CRIS: CLIP-Driven Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Jong-Chul Ye,et al. DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Changsheng Xu,et al. DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering , 2021, IEEE Transactions on Multimedia.
[15] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[16] Zi Huang,et al. Aggregation-Based Graph Convolutional Hashing for Unsupervised Cross-Modal Retrieval , 2021, IEEE Transactions on Multimedia.
[17] Shengsheng Qian,et al. Global Relation-Aware Attention Network for Image-Text Retrieval , 2021, ICMR.
[18] Hui Fang,et al. On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval , 2021, J. Imaging.
[19] Yan Peng,et al. Dual-stream Network for Visual Recognition , 2021, NeurIPS.
[20] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[21] Jes'us Andr'es Portillo-Quintero,et al. A Straightforward Framework For Video Retrieval Using CLIP , 2021, MCPR.
[22] Jonghun Park,et al. Image-to-Image Retrieval by Learning Similarity between Scene Graphs , 2020, AAAI.
[23] Yuning Jiang,et al. Learning the Best Pooling Strategy for Visual Semantic Embedding , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[24] Hao Tian,et al. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.
[25] Hao Yang,et al. PFAN++: Bi-Directional Image-Text Retrieval With Position Focused Attention Network , 2021, IEEE Transactions on Multimedia.
[26] Weifeng Zhang,et al. Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering , 2020, Pattern Recognit..
[27] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.
[28] Qi Zhang,et al. Context-Aware Attention Network for Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Yongdong Zhang,et al. Multi-Modality Cross Attention Network for Image and Sentence Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[31] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.
[32] Yun Fu,et al. Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[33] Dezhong Peng,et al. Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Ji Zhang,et al. Graphical Contrastive Losses for Scene Graph Parsing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[36] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.
[37] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[38] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.
[39] Razvan Pascanu,et al. A simple neural network module for relational reasoning , 2017, NIPS.
[40] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Max Welling,et al. Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.
[42] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[43] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[44] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[46] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.
[47] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[48] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.
[49] F. Scarselli,et al. A new model for learning in graph domains , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..
[50] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.