Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

Visual Question Answering (VQA) has attracted much attention since it offers insight into the relationships between the multi-modal analysis of images and natural language. Most of the current algorithms are incapable of answering open-domain questions that require to perform reasoning beyond the image contents. To address this issue, we propose a novel framework which endows the model capabilities in answering more complex questions by leveraging massive external knowledge with dynamic memory networks. Specifically, the questions along with the corresponding images trigger a process to retrieve the relevant information in external knowledge bases, which are embedded into a continuous vector space by preserving the entity-relation structures. Afterwards, we employ dynamic memory networks to attend to the large body of facts in the knowledge graph and images, and then perform reasoning over these facts to generate corresponding answers. Extensive experiments demonstrate that our model not only achieves the state-of-the-art performance in the visual question answering task, but can also answer open-domain questions effectively by leveraging the external knowledge.

[1]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Deng Cai,et al.  MEMEN: Multi-layer Embedding with Memory Networks for Machine Comprehension , 2017, ArXiv.

[3]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[4]  Catherine Havasi,et al.  Representing General Relational Knowledge in ConceptNet 5 , 2012, LREC.

[5]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[6]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[7]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[8]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Donald Geman,et al.  Visual Turing test for computer vision systems , 2015, Proceedings of the National Academy of Sciences.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[13]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[14]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[15]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[16]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Wei Xu,et al.  Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[21]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[22]  Christopher Kanan,et al.  Visual question answering: Datasets, algorithms, and future challenges , 2016, Comput. Vis. Image Underst..

[23]  Jason Weston,et al.  Large-scale Simple Question Answering with Memory Networks , 2015, ArXiv.

[24]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[25]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[26]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[27]  Richard S. Zemel,et al.  Image Question Answering: A Visual Semantic Embedding Model and a New Dataset , 2015, ArXiv.

[28]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[30]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[31]  Chunhua Shen,et al.  Visual Question Answering with Memory-Augmented Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[34]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[35]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .