Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [14] is state-of-the-art on OK-VQA [37], yet it only achieves 13.0% accuracy on our dataset. Moreover, we experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information from the knowledge base. An oracle experiment with perfect retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and an automatic retrieval-augmented prototype yields 48.8%. We believe that our dataset enables future research on retrieval-augmented vision+language models.

[1]  Alan Ritter,et al.  Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? , 2023, ArXiv.

[2]  David A. Ross,et al.  Reveal: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Noah A. Smith,et al.  PromptCap: Prompt-Guided Task-Aware Image Captioning , 2022, ArXiv.

[4]  Andrew M. Dai,et al.  Scaling Instruction-Finetuned Language Models , 2022, ArXiv.

[5]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[6]  Li Dong,et al.  Language Models are General-Purpose Interfaces , 2022, ArXiv.

[7]  Dustin Schwenk,et al.  A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge , 2022, ECCV.

[8]  Lu Yuan,et al.  REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering , 2022, NeurIPS.

[9]  Radu Soricut,et al.  All You May Need for VQA are Image Captions , 2022, NAACL.

[10]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[11]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[12]  C. Buck,et al.  Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation , 2022, EMNLP.

[13]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[14]  Dmytro Okhonko,et al.  CM3: A Causal Masked Multimodal Model of the Internet , 2022, ArXiv.

[15]  Yonatan Bisk,et al.  KAT: A Knowledge Augmented Transformer for Vision-and-Language , 2021, NAACL.

[16]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[17]  Zhe Gan,et al.  An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, AAAI.

[18]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[19]  Serge J. Belongie,et al.  Benchmarking Representation Learning for Natural World Image Collections , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Soumen Chakrabarti,et al.  Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering , 2021, SIGIR.

[21]  Jiecao Chen,et al.  WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning , 2021, SIGIR.

[22]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[23]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[24]  Marcus Rohrbach,et al.  KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Loïc Barrault,et al.  In Factuality: Efficient Integration of Relevant Facts for Visual Question Answering , 2021, ACL.

[26]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[27]  Fabio Petroni,et al.  Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , 2020, NeurIPS.

[28]  Tobias Weyand,et al.  Google Landmarks Dataset v2 – A Large-Scale Benchmark for Instance-Level Recognition and Retrieval , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[30]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2019, ICLR.

[31]  Ashish Sabharwal,et al.  QASC: A Dataset for Question Answering via Sentence Composition , 2019, AAAI.

[32]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[33]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[34]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[35]  Partha Pratim Talukdar,et al.  KVQA: Knowledge-Aware Visual Question Answering , 2019, AAAI.

[36]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[38]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[39]  Jonathan Berant,et al.  The Web as a Knowledge-Base for Answering Complex Questions , 2018, NAACL.

[40]  Sebastian Riedel,et al.  Constructing Datasets for Multi-hop Reading Comprehension Across Documents , 2017, TACL.

[41]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[43]  Kyunghyun Cho,et al.  SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine , 2017, ArXiv.

[44]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2016, International Journal of Computer Vision.

[45]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[46]  Licheng Yu,et al.  Visual Madlibs: Fill in the blank Image Generation and Question Answering , 2015, ArXiv.

[47]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[48]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[49]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[50]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[51]  Eleanor Rosch,et al.  Principles of Categorization , 1978 .