VITR: Augmenting Vision Transformers with Relation-Focused Learning for Cross-Modal Information Retrieval

Relation-focused cross-modal information retrieval focuses on retrieving information based on relations expressed in user queries, and it is particularly important in information retrieval applications and next-generation search engines. While pre-trained networks like Contrastive Language-Image Pre-training (CLIP) have achieved state-of-the-art performance in cross-modal learning tasks, the Vision Transformer (ViT) used in these networks is limited in its ability to focus on image region relations. Specifically, ViT is trained to match images with relevant descriptions at the global level, without considering the alignment between image regions and descriptions. This paper introduces VITR, a novel network that enhances ViT by extracting and reasoning about image region relations based on a Local encoder. VITR comprises two main components: (1) extending the capabilities of ViT-based cross-modal networks to extract and reason with region relations in images; and (2) aggregating the reasoned results with the global knowledge to predict the similarity scores between images and descriptions. Experiments were carried out by applying the proposed network to relation-focused cross-modal information retrieval tasks on the Flickr30K, RefCOCOg, and CLEVR datasets. The results revealed that the proposed VITR network outperformed various other state-of-the-art networks including CLIP, VSE$\infty$, and VSRN++ on both image-to-text and text-to-image cross-modal information retrieval tasks.

[1]  Zhijun Fang,et al.  An effective CNN and Transformer complementary network for medical image segmentation , 2022, Pattern Recognit..

[2]  G. Cosma,et al.  Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval , 2022, Pattern Recognit..

[3]  Y. Fu,et al.  Image-Text Embedding Learning via Visual and Textual Semantic Reasoning , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Hwanjun Song,et al.  e-CLIP: Large-Scale Vision-Language Representation Learning in E-commerce , 2022, CIKM.

[5]  Zhangyang Wang,et al.  EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  A. Bimbo,et al.  Effective conditioned and composed image retrieval combining CLIP-based features , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yizhao Gao,et al.  COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Errui Ding,et al.  ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xiaodan Liang,et al.  Atom correlation based graph propagation for scene graph generation , 2022, Pattern Recognit..

[10]  Lu Yuan,et al.  RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Lu Yuan,et al.  HairCLIP: Design Your Hair by Text and Reference Image , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Tongliang Liu,et al.  CRIS: CLIP-Driven Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jong-Chul Ye,et al.  DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Changsheng Xu,et al.  DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering , 2021, IEEE Transactions on Multimedia.

[15]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[16]  Zi Huang,et al.  Aggregation-Based Graph Convolutional Hashing for Unsupervised Cross-Modal Retrieval , 2021, IEEE Transactions on Multimedia.

[17]  Shengsheng Qian,et al.  Global Relation-Aware Attention Network for Image-Text Retrieval , 2021, ICMR.

[18]  Hui Fang,et al.  On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval , 2021, J. Imaging.

[19]  Yan Peng,et al.  Dual-stream Network for Visual Recognition , 2021, NeurIPS.

[20]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[21]  Jes'us Andr'es Portillo-Quintero,et al.  A Straightforward Framework For Video Retrieval Using CLIP , 2021, MCPR.

[22]  Jonghun Park,et al.  Image-to-Image Retrieval by Learning Similarity between Scene Graphs , 2020, AAAI.

[23]  Yuning Jiang,et al.  Learning the Best Pooling Strategy for Visual Semantic Embedding , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Hao Tian,et al.  ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.

[25]  Hao Yang,et al.  PFAN++: Bi-Directional Image-Text Retrieval With Position Focused Attention Network , 2021, IEEE Transactions on Multimedia.

[26]  Weifeng Zhang,et al.  Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering , 2020, Pattern Recognit..

[27]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[28]  Qi Zhang,et al.  Context-Aware Attention Network for Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yongdong Zhang,et al.  Multi-Modality Cross Attention Network for Image and Sentence Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[31]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[32]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Dezhong Peng,et al.  Deep Supervised Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ji Zhang,et al.  Graphical Contrastive Losses for Scene Graph Parsing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[37]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[39]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[40]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[42]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[44]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[47]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[48]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[49]  F. Scarselli,et al.  A new model for learning in graph domains , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[50]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.