论文信息 - Multilateral Semantic Relations Modeling for Image Text Retrieval

Multilateral Semantic Relations Modeling for Image Text Retrieval

Image-text retrieval is a fundamental task to bridge vision and language by exploiting various strategies to finegrained alignment between regions and words. This is still tough mainly because of one-to-many correspondence, where a set of matches from another modality can be accessed by a random query. While existing solutions to this problem including multi-point mapping, probabilistic distribution, and geometric embedding have made promising progress, one-to-many correspondence is still under-explored. In this work, we develop a Multilateral Semantic Relations Modeling (termed MSRM) for image-text retrieval to capture the one-to-many correspondence between multiple samples and a given query via hypergraph modeling. Specifically, a given query is first mapped as a probabilistic embedding to learn its true semantic distribution based on Mahalanobis distance. Then each candidate instance in a mini-batch is regarded as a hypergraph node with its mean semantics while a Gaussian query is modeled as a hyperedge to capture the semantic correlations beyond the pair between candidate points and the query. Comprehensive experimental results on two widely used datasets demonstrate that our MSRM method can outper-form state-of-the-art methods in the settlement of multiple matches while still maintaining the comparable performance of instance-level matching.

[1] Heng Tao Shen,et al. Point to Rectangle Matching for Image Text Retrieval , 2022, ACM Multimedia.

[2] Yongdong Zhang,et al. Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching , 2022, AAAI.

[3] Yongdong Zhang,et al. Negative-Aware Attention Framework for Image-Text Matching , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] P. Natarajan,et al. FashionVLP: Vision Language Transformer for Fashion Retrieval with Feedback , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Xiang Wang,et al. Invariant Grounding for Video Question Answering , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] K. Sohn,et al. Probabilistic Representations for Video Contrastive Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Pinaki Nath Chowdhury,et al. Sketch3T: Test-Time Training for Zero-Shot SBIR , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Daniel S. Davila,et al. Cascade Transformers for End-to-End Person Search , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Oriol Vinyals,et al. Integrating Language Guidance into Vision-based Deep Metric Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Oriol Vinyals,et al. Non-isotropy Regularization for Proxy-based Deep Metric Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Xiaojun Wan,et al. GraDual: Graph-based Dual-modal Representation for Image-Text Matching , 2022, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[12] Dongqing Wu,et al. Global-Guided Asymmetric Attention Network for Image-Text Matching , 2022, Neurocomputing.

[13] Xiaowei Hu,et al. Injecting Semantic Concepts into End-to-End Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Guoqing Wang,et al. Meta Self-Paced Learning for Cross-Modal Matching , 2021, ACM Multimedia.

[15] Fanhua Shang,et al. Progressive Semantic Matching for Video-Text Retrieval , 2021, ACM Multimedia.

[16] Limin Wang,et al. Structured Sparse R-CNN for Direct Scene Graph Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Shiyang Yan,et al. Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Baoyuan Wu,et al. Probabilistic Modeling of Semantic Ambiguity for Scene Graph Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Seong Joon Oh,et al. Probabilistic Embeddings for Cross-Modal Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Huchuan Lu,et al. Similarity Reasoning and Filtration for Image-Text Matching , 2021, AAAI.

[21] Yuning Jiang,et al. Learning the Best Pooling Strategy for Visual Semantic Embedding , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Zhong Ji,et al. Consensus-Aware Visual-Semantic Embedding for Image-Text Matching , 2020, ECCV.

[23] Heng Tao Shen,et al. Universal Weighting Metric Learning for Cross-Modal Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Qi Zhang,et al. Context-Aware Attention Network for Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Chunxiao Liu,et al. Graph Structured Network for Image-Text Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Ji Liu,et al. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Jiebo Luo,et al. Adaptive Offline Quintuplet Loss for Image-Text Matching , 2020, ECCV.

[28] Le Wang,et al. Ladder Loss for Coherent Visual-Semantic Embedding , 2019, AAAI.

[29] Yongdong Zhang,et al. Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching , 2019, ACM Multimedia.

[30] Yun Fu,et al. Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31] Xueming Qian,et al. Position Focused Attention Network for Image-Text Matching , 2019, IJCAI.

[32] Yale Song,et al. Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Chong-Wah Ngo,et al. R²GAN: Cross-Modal Recipe Retrieval With Generative Adversarial Network , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Jungong Han,et al. Saliency-Guided Attention Network for Image-Sentence Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35] Xiang Li,et al. Smoothing the Geometry of Probabilistic Box Embeddings , 2018, ICLR.

[36] Seong Joon Oh,et al. Modeling Uncertainty with Hedged Instance Embedding , 2018, ICLR 2018.

[37] Yue Gao,et al. Hypergraph Neural Networks , 2018, AAAI.

[38] Xiang Li,et al. Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures , 2018, ACL.

[39] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[40] Gang Wang,et al. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[42] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[43] S. Lazebnik,et al. Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44] Bernt Schiele,et al. Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[48] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[49] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[50] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[51] Pietro Perona,et al. Caltech-UCSD Birds 200 , 2010 .

[52] Heng Tao Shen,et al. Interclass-Relativity-Adaptive Metric Learning for Cross-Modal Matching and Beyond , 2021, IEEE Transactions on Multimedia.