CLIP-RR: Improved CLIP Network for Relation-Focused Cross-Modal Information Retrieval

Relation-focused cross-modal information retrieval focuses on retrieving information based on relations expressed in user queries, and it is particularly important in information retrieval applications and next-generation search engines. To date, CLIP (Contrastive Language-Image Pre-training) achieved state-of-the-art performance in cross-modal learning tasks due to its efficient learning of visual concepts from natural language supervision. However, CLIP learns visual representations from natural language at a global level without the capability of focusing on image-object relations. This paper proposes a novel CLIP-based network for Relation Reasoning, CLIP-RR, that tackles relation-focused cross-modal information retrieval. The proposed network utilises CLIP to leverage its pre-trained knowledge, and it additionally comprises two main parts: (1) extends the capabilities of CLIP to extract and reason with object relations in images; and (2) aggregates the reasoned results for predicting the similarity scores between images and descriptions. Experiments were carried out by applying the proposed network to relation-focused cross-modal information retrieval tasks on the RefCOCOg, CLEVR, and Flickr30K datasets. The results revealed that the proposed network outperformed various other state-of-the-art networks including CLIP, VSE ∞ , and VSRN++ on both image-to-text and text-to-image cross-modal information retrieval tasks.

[1]  Zhijun Fang,et al.  An effective CNN and Transformer complementary network for medical image segmentation , 2022, Pattern Recognit..

[2]  G. Cosma,et al.  Improving visual-semantic embeddings by learning semantically-enhanced hard negatives for cross-modal information retrieval , 2022, Pattern Recognit..

[3]  Y. Fu,et al.  Image-Text Embedding Learning via Visual and Textual Semantic Reasoning , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Zhangyang Wang,et al.  EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yizhao Gao,et al.  COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Errui Ding,et al.  ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Xiaodan Liang,et al.  Atom correlation based graph propagation for scene graph generation , 2022, Pattern Recognit..

[8]  Lu Yuan,et al.  RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Lu Yuan,et al.  HairCLIP: Design Your Hair by Text and Reference Image , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Tongliang Liu,et al.  CRIS: CLIP-Driven Referring Image Segmentation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Shengsheng Qian,et al.  Global Relation-Aware Attention Network for Image-Text Retrieval , 2021, ICMR.

[12]  Hui Fang,et al.  On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval , 2021, J. Imaging.

[13]  Guoying Zhao,et al.  Scalable multi-label canonical correlation analysis for cross-modal retrieval , 2021, Pattern Recognit..

[14]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[15]  Yuning Jiang,et al.  Learning the Best Pooling Strategy for Visual Semantic Embedding , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Hao Tian,et al.  ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.

[17]  Jintao Li,et al.  Graph-based neural networks for explainable image privacy inference , 2020, Pattern Recognit..

[18]  Weifeng Zhang,et al.  Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering , 2020, Pattern Recognit..

[19]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[20]  Qi Zhang,et al.  Context-Aware Attention Network for Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yongdong Zhang,et al.  Multi-Modality Cross Attention Network for Image and Sentence Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[23]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[24]  Gaofeng Meng,et al.  Learning graph structure via graph convolutional networks , 2019, Pattern Recognit..

[25]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Erwin M. Bakker,et al.  CycleMatch: A cycle-consistent embedding network for image-text matching , 2019, Pattern Recognit..

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Dong-Gyu Lee,et al.  Discriminative context learning with gated recurrent unit for group activity recognition , 2018, Pattern Recognit..

[29]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[30]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[31]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[35]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.