论文信息 - Reveal: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

Reveal: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory

In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. REVEAL consists of four key components: the memory, the encoder, the retriever and the generator. The large-scale memory encodes various sources of multimodal world knowledge (e.g. image-text pairs, question answering pairs, knowledge graph triplets, etc) via a unified encoder. The retriever finds the most relevant knowledge entries in the memory, and the generator fuses the retrieved knowledge with the input query to produce the output. A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data. Furthermore, our approach can use a diverse set of multimodal knowledge sources, which is shown to result in significant gains. We show that REVEAL achieves state-of-the-art results on visual question answering and image captioning.

[1] Kai-Wei Chang,et al. Empowering Language Models with Knowledge Graph Reasoning for Open-Domain Question Answering , 2022, EMNLP.

[2] Ashish V. Thapliyal,et al. PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, arXiv.org.

[3] Li Dong,et al. Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[4] Jane A. Yu,et al. Few-shot Learning with Retrieval Augmented Language Models , 2022, J. Mach. Learn. Res..

[5] Blake A. Hechtman,et al. TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s , 2022, NeurIPS.

[6] Dustin Schwenk,et al. A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge , 2022, ECCV.

[7] Lu Yuan,et al. REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering , 2022, NeurIPS.

[8] J. Leskovec,et al. VQA-GNN: Reasoning with Multimodal Semantic Graph for Visual Question Answering , 2022, ArXiv.

[9] Ed H. Chi,et al. Improving Multi-Task Generalization via Regularizing Spurious Correlation , 2022, NeurIPS.

[10] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..

[11] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[12] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[13] Chunhua Shen,et al. Retrieval Augmented Classification for Long-Tail Visual Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Derek Hoiem,et al. Webly Supervised Concept Expansion for General Purpose Vision Models , 2022, ECCV.

[15] Yonatan Bisk,et al. KAT: A Knowledge Augmented Transformer for Vision-and-Language , 2021, NAACL.

[16] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Zhe Gan,et al. An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.

[18] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.

[19] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Roozbeh Mottaghi,et al. Multi-Modal Answer Validation for Knowledge-Based VQA , 2021, AAAI.

[21] Ron Mokady,et al. ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.

[22] Pratyay Banerjee,et al. Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering , 2021, EMNLP.

[23] Dani Yogatama,et al. End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering , 2021, NeurIPS.

[24] Yejin Choi,et al. VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.

[26] Jiecao Chen,et al. WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , 2021, SIGIR.

[27] Radu Soricut,et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[29] Marcus Rohrbach,et al. KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Edouard Grave,et al. Distilling Knowledge from Reader to Retriever for Question Answering , 2020, ICLR.

[31] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[32] François Gardères,et al. ConceptBert: Concept-Aware Representation for Visual Question Answering , 2020, FINDINGS.

[33] Fabio Petroni,et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[34] Danqi Chen,et al. Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[35] Ming-Wei Chang,et al. REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[36] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[37] Dmytro Okhonko,et al. Unified Open-Domain Question Answering with Structured and Unstructured Knowledge , 2020, ArXiv.

[38] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[39] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[40] Ali Farhadi,et al. OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Jimmy J. Lin,et al. End-to-End Open-Domain Question Answering with BERTserini , 2019, NAACL.

[42] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43] Svetlana Lazebnik,et al. Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering , 2018, NeurIPS.

[44] Qi Wu,et al. FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[46] Chunhua Shen,et al. Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[47] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[48] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[49] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Markus Krötzsch,et al. Wikidata , 2014, Commun. ACM.

[51] Ping Li,et al. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS) , 2014, NIPS.