Dual Semantic Relationship Attention Network for Image-Text Matching

Image-Text Matching is one major task in cross-modal information processing. The main challenge is to learn the unified vision and language representations. Previous methods that perform well on this task primarily focus on the region features in images corresponding to the words in sentences. However, this will cause the regional features to lose contact with the global context, leading to the mismatch with those non-object words in some sentences. In this work, in order to alleviate this problem, a novel Dual Semantic Relationship Attention Network is proposed which mainly consists of two modules, separate semantic relationship module and the joint semantic relationship module. With these two modules, different hierarchies of semantic relationships are learned simultaneously, thus promoting the image-text matching process. Quantitative experiments have been performed on MS-COCO and Flickr-30K and our method outperforms previous approaches by a large margin due to the effectiveness of the dual semantic relationship attention scheme.

[1]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[2]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yang Yang,et al.  Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking , 2019, ACM Multimedia.

[4]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[5]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[6]  Xueming Qian,et al.  Position Focused Attention Network for Image-Text Matching , 2019, IJCAI.

[7]  Takashi Matsubara,et al.  Target-Oriented Deformation of Visual-Semantic Embedding Space , 2019, IEICE Trans. Inf. Syst..

[8]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[9]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Huifang Ma,et al.  Combining Global and Local Similarity for Cross-Media Retrieval , 2020, IEEE Access.

[14]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  David J. Fleet,et al.  VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[16]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[17]  Yang Yang,et al.  Adversarial Cross-Modal Retrieval , 2017, ACM Multimedia.

[18]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[19]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[20]  Xiu-Shen Wei,et al.  Multi-Label Image Recognition With Graph Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Yan Huang,et al.  Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Li Fei-Fei,et al.  Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval , 2015, VL@EMNLP.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Xilin Chen,et al.  Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[29]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[30]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[31]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[33]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[34]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[35]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[36]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[37]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).