Multilateral Semantic Relations Modeling for Image Text Retrieval

Image-text retrieval is a fundamental task to bridge vision and language by exploiting various strategies to finegrained alignment between regions and words. This is still tough mainly because of one-to-many correspondence, where a set of matches from another modality can be accessed by a random query. While existing solutions to this problem including multi-point mapping, probabilistic distribution, and geometric embedding have made promising progress, one-to-many correspondence is still under-explored. In this work, we develop a Multilateral Semantic Relations Modeling (termed MSRM) for image-text retrieval to capture the one-to-many correspondence between multiple samples and a given query via hypergraph modeling. Specifically, a given query is first mapped as a probabilistic embedding to learn its true semantic distribution based on Mahalanobis distance. Then each candidate instance in a mini-batch is regarded as a hypergraph node with its mean semantics while a Gaussian query is modeled as a hyperedge to capture the semantic correlations beyond the pair between candidate points and the query. Comprehensive experimental results on two widely used datasets demonstrate that our MSRM method can outper-form state-of-the-art methods in the settlement of multiple matches while still maintaining the comparable performance of instance-level matching.

[1]  Heng Tao Shen,et al.  Point to Rectangle Matching for Image Text Retrieval , 2022, ACM Multimedia.

[2]  Yongdong Zhang,et al.  Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching , 2022, AAAI.

[3]  Yongdong Zhang,et al.  Negative-Aware Attention Framework for Image-Text Matching , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  P. Natarajan,et al.  FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xiang Wang,et al.  Invariant Grounding for Video Question Answering , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  K. Sohn,et al.  Probabilistic Representations for Video Contrastive Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Pinaki Nath Chowdhury,et al.  Sketch3T: Test-Time Training for Zero-Shot SBIR , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Daniel S. Davila,et al.  Cascade Transformers for End-to-End Person Search , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Oriol Vinyals,et al.  Integrating Language Guidance into Vision-based Deep Metric Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Oriol Vinyals,et al.  Non-isotropy Regularization for Proxy-based Deep Metric Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Xiaojun Wan,et al.  GraDual: Graph-based Dual-modal Representation for Image-Text Matching , 2022, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[12]  Dongqing Wu,et al.  Global-Guided Asymmetric Attention Network for Image-Text Matching , 2022, Neurocomputing.

[13]  Xiaowei Hu,et al.  Injecting Semantic Concepts into End-to-End Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Guoqing Wang,et al.  Meta Self-Paced Learning for Cross-Modal Matching , 2021, ACM Multimedia.

[15]  Fanhua Shang,et al.  Progressive Semantic Matching for Video-Text Retrieval , 2021, ACM Multimedia.

[16]  Limin Wang,et al.  Structured Sparse R-CNN for Direct Scene Graph Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Shiyang Yan,et al.  Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Baoyuan Wu,et al.  Probabilistic Modeling of Semantic Ambiguity for Scene Graph Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Seong Joon Oh,et al.  Probabilistic Embeddings for Cross-Modal Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Huchuan Lu,et al.  Similarity Reasoning and Filtration for Image-Text Matching , 2021, AAAI.

[21]  Yuning Jiang,et al.  Learning the Best Pooling Strategy for Visual Semantic Embedding , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Zhong Ji,et al.  Consensus-Aware Visual-Semantic Embedding for Image-Text Matching , 2020, ECCV.

[23]  Heng Tao Shen,et al.  Universal Weighting Metric Learning for Cross-Modal Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Qi Zhang,et al.  Context-Aware Attention Network for Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Chunxiao Liu,et al.  Graph Structured Network for Image-Text Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Ji Liu,et al.  IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jiebo Luo,et al.  Adaptive Offline Quintuplet Loss for Image-Text Matching , 2020, ECCV.

[28]  Le Wang,et al.  Ladder Loss for Coherent Visual-Semantic Embedding , 2019, AAAI.

[29]  Yongdong Zhang,et al.  Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching , 2019, ACM Multimedia.

[30]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Xueming Qian,et al.  Position Focused Attention Network for Image-Text Matching , 2019, IJCAI.

[32]  Yale Song,et al.  Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Chong-Wah Ngo,et al.  R²GAN: Cross-Modal Recipe Retrieval With Generative Adversarial Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jungong Han,et al.  Saliency-Guided Attention Network for Image-Sentence Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Xiang Li,et al.  Smoothing the Geometry of Probabilistic Box Embeddings , 2018, ICLR.

[36]  Seong Joon Oh,et al.  Modeling Uncertainty with Hedged Instance Embedding , 2018, ICLR 2018.

[37]  Yue Gao,et al.  Hypergraph Neural Networks , 2018, AAAI.

[38]  Xiang Li,et al.  Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures , 2018, ACL.

[39]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[40]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  S. Lazebnik,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[48]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[49]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[50]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[51]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[52]  Heng Tao Shen,et al.  Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond , 2021, IEEE Transactions on Multimedia.