论文信息 - VQA: Visual Question Answering

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines for VQA are provided and compared with human performance.

[1] 斉藤康己,et al. Douglas B. Lenat and R. V. Guha : Building Large Knowledge-Based Systems, Representation and Inference in the Cyc Project, Addison-Wesley (1990). , 1990 .

[2] Ramanathan V. Guha,et al. Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[3] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[4] Hugo Liu,et al. ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[5] Erik Wilde,et al. What are you talking about? , 2007, IEEE International Conference on Services Computing (SCC 2007).

[6] Praveen Paritosh,et al. Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[7] Rob Miller,et al. VizWiz: nearly real-time answers to visual questions , 2010, UIST.

[8] Jeffrey P. Bigham,et al. VizWiz: nearly real-time answers to visual questions , 2010, W4A.

[9] Cyrus Rashtchian,et al. Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[10] Estevam R. Hruschka,et al. Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[11] Yejin Choi,et al. Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[12] Fei-Fei Li,et al. Hierarchical semantic indexing for large scale image retrieval , 2011, CVPR 2011.

[13] Karl Stratos,et al. Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[14] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[15] C. Lawrence Zitnick,et al. Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16] Benjamin Van Durme,et al. Reporting Bias and Knowledge Extraction , 2013 .

[17] Xinlei Chen,et al. NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[18] Matthew Richardson,et al. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[19] Lucy Vanderwende,et al. Learning the Visual Interpretation of Sentences , 2013, 2013 IEEE International Conference on Computer Vision.

[20] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21] Kees van Deemter,et al. Typicality and Object Reference , 2013, CogSci.

[22] Peter Young,et al. Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[23] Oren Etzioni,et al. Paraphrase-Driven Learning for Open Question Answering , 2013, ACL.

[24] Kees van Deemter,et al. Generating Expressions that Refer to Visible Objects , 2013, NAACL.

[25] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[26] Bernt Schiele,et al. Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[27] Sanja Fidler,et al. What Are You Talking About? Text-to-Image Coreference , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28] Kewei Tu,et al. Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[29] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[30] Frank Keller,et al. Comparing Automatic Evaluation Measures for Image Description , 2014, ACL.

[31] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[32] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[33] Wei Xu,et al. Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[34] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[35] Fei-Fei Li,et al. Linking People in Videos with "Their" Names Using Coreference Resolution , 2014, ECCV.

[36] Mario Fritz,et al. A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[37] C. Lawrence Zitnick,et al. Zero-Shot Learning via Visual Abstraction , 2014, ECCV.

[38] Glen Coppersmith,et al. Dynamic Wordclouds and Vennclouds for Exploratory Data Analysis , 2014 .

[39] Oren Etzioni,et al. Open question answering over curated and extracted knowledge bases , 2014, KDD.

[40] Donald Geman,et al. Visual Turing test for computer vision systems , 2015, Proceedings of the National Academy of Sciences.

[41] C. Lawrence Zitnick,et al. Learning Common Sense through Visual Abstraction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Wei Xu,et al. Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question , 2015, NIPS.

[45] Ali Farhadi,et al. VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Lisa Anne Hendricks,et al. Long-term recurrent convolutional networks for visual recognition and description , 2015, CVPR.

[47] Richard S. Zemel,et al. Exploring Models and Data for Image Question Answering , 2015, NIPS.

[48] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[49] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[50] Xinlei Chen,et al. Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Mario Fritz,et al. Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52] Licheng Yu,et al. Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[53] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Xiao Lin,et al. Don't just listen, use your imagination: Leveraging visual common sense for non-visual tasks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[56] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[57] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Yash Goyal,et al. Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Jason Weston,et al. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[60] C. Lawrence Zitnick,et al. Adopting Abstract Images for Semantic Scene Understanding , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.