Passage Retrieval for Outside-Knowledge Visual Question Answering

In this work, we address multi-modal information needs that contain text questions and images by focusing on passage retrieval for outside-knowledge visual question answering. This task requires access to outside knowledge, which in our case we define to be a large unstructured passage collection. We first conduct sparse retrieval with BM25 and study expanding the question with object names and image captions. We verify that visual clues play an important role and captions tend to be more informative than object names in sparse retrieval. We then construct a dual-encoder dense retriever, with the query encoder being LXMERT, a multi-modal pre-trained transformer. We further show that dense retrieval significantly outperforms sparse retrieval that uses object expansion. Moreover, dense retrieval matches the performance of sparse retrieval that leverages human-generated captions.

[1]  Ming-Wei Chang,et al.  Latent Retrieval for Weakly Supervised Open Domain Question Answering , 2019, ACL.

[2]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Yue Hu,et al.  Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering , 2020, IJCAI.

[5]  Alexander G. Schwing,et al.  Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering , 2018, ECCV.

[6]  Hamed Zamani,et al.  Towards Multi-Modal Conversational Information Seeking , 2021, SIGIR.

[7]  W. Bruce Croft,et al.  Recipe Retrieval with Visual Query of Ingredients , 2020, SIGIR.

[8]  Licheng Yu,et al.  Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  François Gardères,et al.  ConceptBert: Concept-Aware Representation for Visual Question Answering , 2020, FINDINGS.

[10]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[11]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  J. Shane Culpepper,et al.  Risk-Reward Trade-offs in Rank Fusion , 2017, ADCS.

[13]  W. Bruce Croft,et al.  Open-Retrieval Conversational Question Answering , 2020, SIGIR.

[14]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[15]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[16]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[17]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[18]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Jacob Eisenstein,et al.  Sparse, Dense, and Attentional Representations for Text Retrieval , 2021, Transactions of the Association for Computational Linguistics.

[21]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Charles L. A. Clarke,et al.  Reciprocal rank fusion outperforms condorcet and individual rank learning methods , 2009, SIGIR.

[23]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[24]  Wenhan Xiong,et al.  Answering Complex Open-Domain Questions with Multi-Hop Dense Retrieval , 2020, International Conference on Learning Representations.

[25]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[26]  Wenwu Zhu,et al.  Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks , 2017, ArXiv.

[27]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Weifeng Zhang,et al.  Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering , 2020, Pattern Recognit..

[30]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[31]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[32]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[34]  Jiafeng Guo,et al.  IART: Intent-aware Response Ranking with Transformers in Information-seeking Conversation Systems , 2020, WWW.

[35]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[36]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[37]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  W. Bruce Croft,et al.  Weakly-Supervised Open-Retrieval Conversational Question Answering , 2021, ECIR.

[39]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[40]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[41]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[42]  Svetlana Lazebnik,et al.  Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering , 2018, NeurIPS.

[43]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[44]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[45]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[46]  Ming Zhou,et al.  Reinforced Mnemonic Reader for Machine Reading Comprehension , 2017, IJCAI.

[47]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.