GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering

We introduce GQA, a new dataset for real-world visual reasoning and compositional question answering, seeking to address key shortcomings of previous VQA datasets. We have developed a strong and robust question engine that leverages Visual Genome scene graph structures to create 22M diverse reasoning questions, which all come with functional programs that represent their semantics. We use the programs to gain tight control over the answer distribution and present a new tunable smoothing technique to mitigate question biases. Accompanying the dataset is a suite of new metrics that evaluate essential qualities such as consistency, grounding and plausibility. A careful analysis is performed for baselines as well as state-of-the-art models, providing fine-grained results for different question types and topologies. Whereas a blind LSTM obtains a mere 42.1%, and strong VQA models achieve 54.1%, human performance tops at 89.3%, offering ample opportunity for new research to explore. We hope GQA will provide an enabling resource for the next generation of models with enhanced robustness, improved consistency, and deeper semantic understanding of vision and language.

[1]  R. L. Ebel,et al.  An Analysis of Test-Wiseness , 1965 .

[2]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[3]  Jeffery J. Mondak,et al.  Asked and Answered: Knowledge Levels When We Will Not Take “Don't Know” for an Answer , 2001 .

[4]  M. Bar-Hillel,et al.  Guess Where: The Position of Correct Answers in Multiple-Choice Test Items as a Psychometric Variable , 2003 .

[5]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[8]  Miljenko Huzak,et al.  Chi-Square Distribution , 2011, International Encyclopedia of Statistical Science.

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Michael S. Bernstein,et al.  Image retrieval using scene graphs , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[17]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  David A. Shamma,et al.  YFCC100M , 2015, Commun. ACM.

[20]  Allan Jabri,et al.  Revisiting Visual Question Answering Baselines , 2016, ECCV.

[21]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[22]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Margaret Mitchell,et al.  Generating Natural Questions About an Image , 2016, ACL.

[24]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[25]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[26]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[28]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Stefan Lee,et al.  The Promise of Premise: Harnessing Question Premises in Visual Question Answering , 2017, EMNLP.

[30]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Alexander G. Schwing,et al.  Creativity: Generating Diverse Questions Using Variational Autoencoders , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Christopher Kanan,et al.  Visual question answering: Datasets, algorithms, and future challenges , 2016, Comput. Vis. Image Underst..

[33]  Christopher Kanan,et al.  An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and VQA , 2017, ArXiv.

[36]  Shaodi You,et al.  Automatic Generation of Grounded Visual Questions , 2016, IJCAI.

[37]  Anton van den Hengel,et al.  Graph-Structured Representations for Visual Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jianfei Cai,et al.  VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions , 2018, ECCV.

[39]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[40]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Trevor Darrell,et al.  Multimodal Explanations: Justifying Decisions and Pointing to the Evidence , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Yoav Artzi,et al.  A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.