A Corpus for Reasoning about Natural Language Grounded in Photographs

We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.

[1]  Francis Ferraro,et al.  A Survey of Current Datasets for Vision and Language Research , 2015, EMNLP.

[2]  Trevor Darrell,et al.  Explainable Neural Computation via Stack Neural Module Networks , 2018, ECCV.

[3]  Asim Kadav,et al.  Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[4]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Wenhu Chen,et al.  Bootstrap, Review, Decode: Using Out-of-Domain Textual Data to Improve Image Captioning , 2016, ArXiv.

[6]  Dhruv Batra,et al.  C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset , 2017, ArXiv.

[7]  Louis-Philippe Morency,et al.  Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[8]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9]  Licheng Yu,et al.  TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[10]  Ross A. Knepper,et al.  Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction , 2018, CoRL.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[13]  Ali Farhadi,et al.  From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Héctor Allende,et al.  Working Memory Networks: Augmenting Memory Networks with a Relational Reasoning Module , 2018, ACL.

[15]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Dhruv Batra,et al.  Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[19]  Alexander Kuhnle,et al.  ShapeWorld - A new test methodology for multimodal language understanding , 2017, ArXiv.

[20]  Jason Weston,et al.  Talk the Walk: Navigating New York City through Grounded Dialogue , 2018, ArXiv.

[21]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[22]  Vicente Ordonez,et al.  ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[23]  Benjamin Kuipers,et al.  Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions , 2006, AAAI.

[24]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[25]  Chen Huang,et al.  Learning to Disambiguate by Asking Discriminative Questions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Hao Tan,et al.  Object Ordering with Bidirectional Matchings for Visual Reasoning , 2018, NAACL-HLT.

[27]  Justin Johnson,et al.  DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer , 2018, ArXiv.

[28]  Xiao-Jing Wang,et al.  A dataset and architecture for visual reasoning with a working memory , 2018, ECCV.

[29]  Luke S. Zettlemoyer,et al.  Learning Distributions over Logical Forms for Referring Expression Generation , 2013, EMNLP.

[30]  Kees van Deemter,et al.  Natural Reference to Objects in a Visual Domain , 2010, INLG.

[31]  C. Lawrence Zitnick,et al.  Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Hexiang Hu,et al.  Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding , 2019, ArXiv.

[33]  Christopher D. Manning,et al.  Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[34]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[35]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[37]  Andrew Bennett,et al.  Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[38]  Jonathan Berant,et al.  Weakly Supervised Semantic Parsing with Abstract Examples , 2017, ACL.

[39]  Yoav Artzi,et al.  TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41]  Yoav Artzi,et al.  A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[42]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[43]  David Mascharka,et al.  Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Bo Xu,et al.  Cascaded Mutual Modulation for Visual Reasoning , 2018, EMNLP.

[45]  Christopher Kanan,et al.  TallyQA: Answering Complex Counting Questions , 2018, AAAI.

[46]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[47]  Chris Callison-Burch,et al.  Effectively Crowdsourcing Radiology Report Annotations , 2015, Louhi@EMNLP.

[48]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[50]  Razvan Pascanu,et al.  A simple neural network module for relational reasoning , 2017, NIPS.

[51]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[52]  Kewei Tu,et al.  Structured Attentions for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53]  Raymond J. Mooney,et al.  Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[54]  Daniel Marcu,et al.  Towards a Dataset for Human Computer Communication via Grounded Language Acquisition , 2016, AAAI Workshop: Symbiotic Cognitive Systems.

[55]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[56]  Yash Goyal,et al.  Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Michael S. Bernstein,et al.  Scalable multi-label annotation , 2014, CHI.

[58]  Matthew Stone,et al.  “Caption” as a Coherence Relation: Evidence and Implications , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[59]  Qi Wu,et al.  Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Chuang Gan,et al.  Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[63]  Yoshua Bengio,et al.  FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[64]  Christopher Kanan,et al.  An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[66]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[67]  Luke S. Zettlemoyer,et al.  A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[68]  Carl Doersch,et al.  Learning Visual Question Answering by Bootstrapping Hard Attention , 2018, ECCV.

[69]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[71]  Christopher D. Manning,et al.  GQA: a new dataset for compositional question answering over real-world images , 2019, ArXiv.