A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering

Outside-knowledge visual question answering (OKVQA) requires the agent to comprehend the image, make use of relevant knowledge from the entire web, and digest all the information to answer the question. Most previous works address the problem by first fusing the image and question in the multi-modal space, which is inflexible for further fusion with a vast amount of external knowledge. In this paper, we call for a paradigm shift for the OK-VQA task, which transforms the image into plain text, so that we can enable knowledge passage retrieval, and generative question-answering in the natural language space. This paradigm takes advantage of the sheer volume of gigantic knowledge bases and the richness of pretrained language models. A Transform-Retrieve-Generate framework (TRiG) framework is proposed1, which can be plug-and-played with alternative image-to-text models and textual knowledge bases. Experimental results show that our TRiG framework outperforms all state-of-the-art supervised methods by at least 11.1% absolute margin.

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  Fengmao Lv,et al.  TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering , 2020, ECCV.

[3]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[4]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[5]  Jun Yan,et al.  Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering , 2020, EMNLP.

[6]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[7]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[8]  Peter Clark,et al.  GenericsKB: A Knowledge Base of Generic Statements , 2020, ArXiv.

[9]  Partha Pratim Talukdar,et al.  KVQA: Knowledge-Aware Visual Question Answering , 2019, AAAI.

[10]  Edouard Grave,et al.  Distilling Knowledge from Reader to Retriever for Question Answering , 2020, ArXiv.

[11]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[12]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[13]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[15]  Eneko Agirre,et al.  Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering , 2021, ArXiv.

[16]  Anton van den Hengel,et al.  Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge , 2021, LANTERN.

[17]  Dacheng Tao,et al.  Bilinear Graph Networks for Visual Question Answering , 2019 .

[18]  Xinlei Chen,et al.  In Defense of Grid Features for Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Marcus Rohrbach,et al.  KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[21]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[22]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[23]  Pratyay Banerjee,et al.  Weakly-Supervised Visual-Retriever-Reader for Knowledge-based Question Answering , 2021, EMNLP.

[24]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[25]  Zhe Gan,et al.  An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA , 2021, ArXiv.

[26]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Soumen Chakrabarti,et al.  Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering , 2021, SIGIR.

[28]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[29]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[30]  Marcus Rohrbach,et al.  Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering , 2019, ICML.

[31]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[32]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[33]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Christopher Clark,et al.  Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[36]  Juan Carlos Niebles,et al.  Interpretable Visual Question Answering by Visual Grounding From Attention Supervision Mining , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  Edouard Grave,et al.  Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering , 2020, EACL.

[38]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[39]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Nan Duan,et al.  Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering , 2019, AAAI.

[41]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[42]  Jure Leskovec,et al.  QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering , 2021, NAACL.

[43]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[44]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[45]  Ruslan Salakhutdinov,et al.  Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text , 2018, EMNLP.

[46]  Ahmed El Kholy,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[47]  Jimmy J. Lin,et al.  End-to-End Open-Domain Question Answering with BERTserini , 2019, NAACL.

[48]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[49]  Xiaoyan Wang,et al.  Improving Natural Language Inference Using External Knowledge in the Science Questions Domain , 2018, AAAI.

[50]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Lei Zhang,et al.  VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.

[54]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[55]  François Gardères,et al.  ConceptBert: Concept-Aware Representation for Visual Question Answering , 2020, FINDINGS.

[56]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Wei Zhang,et al.  R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering , 2018, KDD.

[58]  Roozbeh Mottaghi,et al.  Multi-Modal Answer Validation for Knowledge-Based VQA , 2021, AAAI.

[59]  ‘Just because you are right, doesn’t mean I am wrong’: Overcoming a bottleneck in development and evaluation of Open-Ended VQA tasks , 2021, EACL.

[60]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Xiang Ren,et al.  KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning , 2019, EMNLP.

[62]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[63]  Ramesh Nallapati,et al.  Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering , 2019, EMNLP.

[64]  Pengchuan Zhang,et al.  Image Scene Graph Generation (SGG) Benchmark , 2021, ArXiv.

[65]  Can Gao,et al.  UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2021, ACL/IJCNLP.

[66]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).