ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space – where an efficient kNN search can be performed – by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN.

[1]  Marcella Cornia,et al.  CaMEL: Mean Teacher Learning for Image Captioning , 2022, 2022 26th International Conference on Pattern Recognition (ICPR).

[2]  Tejas Gokhale,et al.  Weakly Supervised Relative Spatial Reasoning for Visual Question Answering , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Rita Cucchiara,et al.  From Show to Tell: A Survey on Deep Learning-Based Image Captioning , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Claudio Gennaro,et al.  Towards Efficient Cross-Modal Visual Textual Retrieval using Transformer-Encoder Deep Features , 2021, 2021 International Conference on Content-Based Multimedia Indexing (CBMI).

[6]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[8]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[9]  Qingrong Cheng,et al.  Learning Dual Semantic Relations With Graph Attention for Image-Text Matching , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Liqiang Nie,et al.  Context-Aware Multi-View Summarization Network for Image-Text Matching , 2020, ACM Multimedia.

[11]  Andrea Esuli,et al.  Fine-Grained Visual Textual Alignment for Cross-Modal Retrieval Using Transformer Encoders , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[12]  H. R. Tavakoli,et al.  A unified cycle-consistent neural model for text and image retrieval , 2020, Multimedia Tools and Applications.

[13]  Radoslaw Bialobrzeski,et al.  Context-Aware Learning to Rank with Self-Attention , 2020, ArXiv.

[14]  Andrea Esuli,et al.  Transformer Reasoning Network for Image- Text Matching and Retrieval , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[15]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[16]  Marcella Cornia,et al.  A Novel Attention-based Aggregation Function to Combine Vision and Language , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[17]  Hanwang Zhang,et al.  More Grounded Image Captioning by Distilling Image-Text Matching Model , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Qi Wu,et al.  Image and Sentence Matching via Semantic Concepts and Order Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Lin Su,et al.  ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[20]  Marcella Cornia,et al.  Meshed-Memory Transformer for Image Captioning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Sebastian Bruch,et al.  An Alternative Cross Entropy Loss for Learning-to-Rank , 2019, WWW.

[23]  Quoc V. Le,et al.  Self-Training With Noisy Student Improves ImageNet Classification , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[25]  Qingming Huang,et al.  Learning Fragment Self-Attention Embeddings for Image-Text Matching , 2019, ACM Multimedia.

[26]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[27]  Jason J. Corso,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[28]  Jianfeng Gao,et al.  Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators , 2019, ArXiv.

[29]  Yun Fu,et al.  Visual Semantic Reasoning for Image-Text Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Ioannis A. Kakadiaris,et al.  Adversarial Representation Learning for Text-to-Image Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[32]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[33]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[34]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[35]  Kaisheng Ma,et al.  Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Geoffrey E. Hinton,et al.  Large scale distributed neural network training through online distillation , 2018, ICLR.

[37]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[39]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[41]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.