Combine Early and Late Fusion Together: A Hybrid Fusion Framework for Image-Text Matching

Image-text matching is a challenging task in cross-modal learning due to the discrepancy of data representation be-tween different modalities of images and texts. The main-stream methods adopt the late fusion to generate image-text similarity on encoded cross-modal features, and put effort to capture intra-modality associations with considerably high training cost. In this work, we propose to Combine Early and Late Fusion Together (CELFT), which is a universal hybrid fusion framework that can effectively overcome the above shortcomings of the late fusion scheme. In the pro-posed CELFT framework, the hybrid structure with early fusion and late fusion could facilitate the interaction between image and text modalities at early stage. Moreover, these two kinds of fusion strategies complement each other in capturing the inter-modal and intra-modal information, which ensure to learn more accurate image-text similarity. In the experiments, we choose four latest approaches based on the late fusion scheme as the base models, and integrate them with our CELFT framework. The results on two widely used image-text datasets MSCOCO and Flickr30K show that the matching performance of all base models is significantly improved with remarkably reduced training time.

[1]  Yan Huang,et al.  Re-ranking image-text matching by adaptive metric fusion , 2020, Pattern Recognit..

[2]  Huimin Lu,et al.  Ternary Adversarial Networks With Self-Supervision for Zero-Shot Cross-Modal Retrieval , 2020, IEEE Transactions on Cybernetics.

[3]  Chunxiao Liu,et al.  Graph Structured Network for Image-Text Matching , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Ji Liu,et al.  IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Heng Tao Shen,et al.  Cross-Modal Attention With Semantic Consistence for Image–Text Matching , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[6]  Xilin Chen,et al.  Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[7]  Thanh-Toan Do,et al.  Compact Trilinear Interaction for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Xiaogang Wang,et al.  CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Yang Yang,et al.  Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking , 2019, ACM Multimedia.

[11]  David Reitter,et al.  Fusion of Detected Objects in Text for Visual Question Answering , 2019, EMNLP.

[12]  Christopher Kanan,et al.  Answer Them All! Toward Universal Visual Question Answering Models , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[14]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[15]  Yan Huang,et al.  Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[17]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.