论文信息 - LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computational cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by extracting pre-cached feature indexes offline, and employing instant dot-product matching online, which significantly speeds up retrieval process. In fact, our LightningDOT achieves superior performance across mainstream ITR benchmarks such as Flickr30k and COCO datasets, outperforming existing pre-trained models that consume 1000 times magnitude of computational hours using the same features.

[1] Jianfeng Gao,et al. M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[3] Yelong Shen,et al. Generation-Augmented Retrieval for Open-Domain Question Answering , 2020, ACL.

[4] Paul N. Bennett,et al. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ICLR.

[5] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[6] Qi Zhang,et al. Context-Aware Attention Network for Image-Text Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Fabio Petroni,et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[8] Li Dong,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[9] Danqi Chen,et al. Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[10] Bryan A. Plummer,et al. Learning to Scale Multilingual Representations for Vision-Language Tasks , 2020, ECCV.

[11] Jianlong Fu,et al. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[12] Ming-Wei Chang,et al. REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[13] Marcus Rohrbach,et al. 12-in-1: Multi-Task Vision and Language Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[15] Jonatas Wehrmann,et al. Language-Agnostic Visual-Semantic Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[16] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[17] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[18] Jason J. Corso,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.

[19] Xin Jiang,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[20] Jianfeng Gao,et al. Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators , 2019, ArXiv.

[21] Xiaogang Wang,et al. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22] Bryan A. Plummer,et al. MULE: Multimodal Universal Language Embedding , 2019, AAAI.

[23] Jason Baldridge,et al. Learning Dense Representations for Entity Retrieval , 2019, CoNLL.

[24] Yu Cheng,et al. Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[25] Jifeng Dai,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[26] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[27] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[28] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[29] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[30] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[31] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[32] Desmond Elliott,et al. Findings of the Third Shared Task on Multimodal Machine Translation , 2018, WMT.

[33] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[34] Xirong Li,et al. COCO-CN for Cross-Lingual Image Tagging, Captioning, and Retrieval , 2018, IEEE Transactions on Multimedia.

[35] Xi Chen,et al. Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[36] Yan Huang,et al. Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37] Gang Wang,et al. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[39] Desmond Elliott,et al. Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description , 2017, WMT.

[40] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[42] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[43] Akikazu Takeuchi,et al. STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset , 2017, ACL.

[44] Matthew Henderson,et al. Efficient Natural Language Response Suggestion for Smart Reply , 2017, ArXiv.

[45] S. Lazebnik,et al. Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46] Jeff Johnson,et al. Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[47] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.

[48] Khalil Sima'an,et al. Multi30K: Multilingual English-German Image Descriptions , 2016, VL@ACL.

[49] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[50] Alexandra Birch,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[51] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[53] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[54] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[55] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[56] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[58] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[59] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[60] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.