Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching

Learning semantic correspondence between image and text is significant as it bridges the semantic gap between vision and language. The key challenge is to accurately find and correlate shared semantics in image and text. Most existing methods achieve this goal by representing the shared semantic as a weighted combination of all the fragments (image regions or text words), where fragments relevant to the shared semantic obtain more attention, otherwise less. However, despite relevant ones contribute more to the shared semantic, irrelevant ones will more or less disturb it, and thus will lead to semantic misalignment in the correlation phase. To address this issue, we present a novel Bidirectional Focal Attention Network (BFAN), which not only allows to attend to relevant fragments but also diverts all the attention into these relevant fragments to concentrate on them. The main difference with existing works is they mostly focus on learning attention weight while our BFAN focus on eliminating irrelevant fragments from the shared semantic. The focal attention is achieved by preassigning attention based on inter-modality relation, identifying relevant fragments based on intra-modality relation and reassigning attention. Furthermore, the focal attention is jointly applied in both image-to-text and text-to-image directions, which enables to avoid preference to long text or complex image. Experiments show our simple but effective framework significantly outperforms state-of-the-art, with relative Recall@1 gains of 2.2% on both Flicr30K and MSCOCO benchmarks.

[1]  Qingming Huang,et al.  Learning Semantic Structure-preserved Embeddings for Cross-modal Retrieval , 2018, ACM Multimedia.

[2]  Yan Huang,et al.  Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[4]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[5]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[7]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[9]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Zhoujun Li,et al.  Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching , 2019, IEEE Transactions on Image Processing.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Huchuan Lu,et al.  Deep Cross-Modal Projection Learning for Image-Text Matching , 2018, ECCV.

[14]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[15]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[16]  Yi Yang,et al.  Dual-Path Convolutional Image-Text Embedding , 2017, ArXiv.

[17]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[18]  Aviv Eisenschtat,et al.  Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yu Liu,et al.  Learning a Recurrent Residual Fusion Network for Multimodal Matching , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[20]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Xiaogang Wang,et al.  Identity-Aware Textual-Visual Matching with Latent Co-attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[23]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[25]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[28]  Qingming Huang,et al.  Online Asymmetric Similarity Learning for Cross-Modal Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Wei Wang,et al.  Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[32]  Haoqi Fan,et al.  Stacked Latent Attention for Multimodal Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Zhedong Zheng,et al.  Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[34]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[36]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[37]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[40]  Rodrigo C. Barros,et al.  Bidirectional Retrieval Made Simple , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.