Multi-Modal Answer Validation for Knowledge-Based VQA

The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in various forms, including visual, textual, and commonsense knowledge. Using more knowledge sources increases the chance of retrieving more irrelevant or noisy facts, making it challenging to comprehend the facts and find the answer. To address this challenge, we propose Multi-modal Answer Validation using External knowledge (MAVEx), where the idea is to validate a set of promising answer candidates based on answer-specific knowledge retrieval. Instead of searching for the answer in a vast collection of often irrelevant facts as most existing approaches do, MAVEx aims to learn how to extract relevant knowledge from noisy sources, which knowledge source to trust for each answer candidate, and how to validate the candidate using that source. Our multi-modal setting is the first to leverage external visual knowledge (images searched using Google), in addition to textual knowledge in the form of Wikipedia sentences and ConceptNet concepts. Our experiments with OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx achieves new state-of-the-art results. Our code is available at https://github.com/jialinwu17/MAVEX

[1]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[2]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[3]  François Gardères,et al.  ConceptBert: Concept-Aware Representation for Visual Question Answering , 2020, FINDINGS.

[4]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Christopher D. Manning,et al.  GQA: a new dataset for compositional question answering over real-world images , 2019, ArXiv.

[6]  Ali Farhadi,et al.  OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Peter Clark,et al.  Answering Complex Questions Using Open Information Extraction , 2017, ACL.

[8]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[9]  W. Bruce Croft,et al.  Passage Retrieval for Outside-Knowledge Visual Question Answering , 2021, SIGIR.

[10]  Jianlong Fu,et al.  Learning Rich Image Region Representation for Visual Question Answering , 2019, ArXiv.

[11]  Jason Weston,et al.  Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[12]  Anton van den Hengel,et al.  Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge , 2021, LANTERN.

[13]  Matthieu Cord,et al.  MUREL: Multimodal Relational Reasoning for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[15]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Wai Lam,et al.  AnswerFact: Fact Checking in Product Question Answering , 2020, EMNLP.

[17]  Marcus Rohrbach,et al.  KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[19]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[20]  Junyeong Kim,et al.  Progressive Attention Memory Network for Movie Story Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[23]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[24]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Matthieu Cord,et al.  MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Xin Wang,et al.  Boosting Visual Question Answering with Context-aware Knowledge Aggregation , 2020, ACM Multimedia.

[29]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[30]  Percy Liang,et al.  Transforming Question Answering Datasets Into Natural Language Inference Datasets , 2018, ArXiv.

[31]  Raymond J. Mooney,et al.  Improving VQA and its Explanations by Comparing Competing Explanations , 2020, ArXiv.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[34]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[35]  Eunsol Choi,et al.  Decontextualization: Making Sentences Stand-Alone , 2021, Transactions of the Association for Computational Linguistics.

[36]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[37]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[38]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[39]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[40]  Luke S. Zettlemoyer,et al.  End-to-end Neural Coreference Resolution , 2017, EMNLP.

[41]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[42]  Alessandro Moschitti,et al.  Joint Models for Answer Verification in Question Answering Systems , 2021, ACL.

[43]  Svetlana Lazebnik,et al.  Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering , 2018, NeurIPS.

[44]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[45]  William W. Cohen,et al.  PullNet: Open Domain Question Answering with Iterative Retrieval on Knowledge Bases and Text , 2019, EMNLP.

[46]  Alexander G. Schwing,et al.  Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering , 2018, ECCV.

[47]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[49]  Yue Hu,et al.  Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering , 2020, IJCAI.

[50]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[51]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[52]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).