论文信息 - A Corpus for Reasoning about Natural Language Grounded in Photographs

A Corpus for Reasoning about Natural Language Grounded in Photographs

We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.

[1] Francis Ferraro,et al. A Survey of Current Datasets for Vision and Language Research , 2015, EMNLP.

[2] Trevor Darrell,et al. Explainable Neural Computation via Stack Neural Module Networks , 2018, ECCV.

[3] Asim Kadav,et al. Visual Entailment: A Novel Task for Fine-Grained Image Understanding , 2019, ArXiv.

[4] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Wenhu Chen,et al. Bootstrap, Review, Decode: Using Out-of-Domain Textual Data to Improve Image Captioning , 2016, ArXiv.

[6] Dhruv Batra,et al. C-VQA: A Compositional Split of the Visual Question Answering (VQA) v1.0 Dataset , 2017, ArXiv.

[7] Louis-Philippe Morency,et al. Using Syntax to Ground Referring Expressions in Natural Images , 2018, AAAI.

[8] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[9] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.

[10] Ross A. Knepper,et al. Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction , 2018, CoRL.

[11] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[13] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Héctor Allende,et al. Working Memory Networks: Augmenting Memory Networks with a Relational Reasoning Module , 2018, ACL.

[15] Aaron C. Courville,et al. FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[16] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17] Dhruv Batra,et al. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18] Xinlei Chen,et al. Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[19] Alexander Kuhnle,et al. ShapeWorld - A new test methodology for multimodal language understanding , 2017, ArXiv.

[20] Jason Weston,et al. Talk the Walk: Navigating New York City through Grounded Dialogue , 2018, ArXiv.

[21] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[22] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.

[23] Benjamin Kuipers,et al. Walk the Talk: Connecting Language, Knowledge, and Action in Route Instructions , 2006, AAAI.

[24] Sergey Ioffe,et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[25] Chen Huang,et al. Learning to Disambiguate by Asking Discriminative Questions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26] Hao Tan,et al. Object Ordering with Bidirectional Matchings for Visual Reasoning , 2018, NAACL-HLT.

[27] Justin Johnson,et al. DDRprog: A CLEVR Differentiable Dynamic Reasoning Programmer , 2018, ArXiv.

[28] Xiao-Jing Wang,et al. A dataset and architecture for visual reasoning with a working memory , 2018, ECCV.

[29] Luke S. Zettlemoyer,et al. Learning Distributions over Logical Forms for Referring Expression Generation , 2013, EMNLP.

[30] Kees van Deemter,et al. Natural Reference to Objects in a Visual Domain , 2010, INLG.

[31] C. Lawrence Zitnick,et al. Bringing Semantics into Focus Using Visual Abstraction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Hexiang Hu,et al. Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding , 2019, ArXiv.

[33] Christopher D. Manning,et al. Compositional Attention Networks for Machine Reasoning , 2018, ICLR.

[34] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.

[35] Alan L. Yuille,et al. Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] J. R. Landis,et al. The measurement of observer agreement for categorical data. , 1977, Biometrics.

[37] Andrew Bennett,et al. Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction , 2018, EMNLP.

[38] Jonathan Berant,et al. Weakly Supervised Semantic Parsing with Abstract Examples , 2017, ACL.

[39] Yoav Artzi,et al. TOUCHDOWN: Natural Language Navigation and Spatial Reasoning in Visual Street Environments , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[41] Yoav Artzi,et al. A Corpus of Natural Language for Visual Reasoning , 2017, ACL.

[42] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[43] David Mascharka,et al. Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44] Bo Xu,et al. Cascaded Mutual Modulation for Visual Reasoning , 2018, EMNLP.

[45] Christopher Kanan,et al. TallyQA: Answering Complex Counting Questions , 2018, AAAI.

[46] Dan Klein,et al. Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[47] Chris Callison-Burch,et al. Effectively Crowdsourcing Radiology Report Annotations , 2015, Louhi@EMNLP.

[48] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[50] Razvan Pascanu,et al. A simple neural network module for relational reasoning , 2017, NIPS.

[51] Jeffrey L. Elman,et al. Finding Structure in Time , 1990, Cogn. Sci..

[52] Kewei Tu,et al. Structured Attentions for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[53] Raymond J. Mooney,et al. Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[54] Daniel Marcu,et al. Towards a Dataset for Human Computer Communication via Grounded Language Acquisition , 2016, AAAI Workshop: Symbiotic Cognitive Systems.

[55] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[56] Yash Goyal,et al. Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Michael S. Bernstein,et al. Scalable multi-label annotation , 2014, CHI.

[58] Matthew Stone,et al. “Caption” as a Coherence Relation: Evidence and Implications , 2019, Proceedings of the Second Workshop on Shortcomings in Vision and Language.

[59] Qi Wu,et al. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60] Dan Klein,et al. Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61] Li Fei-Fei,et al. Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62] Chuang Gan,et al. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding , 2018, NeurIPS.

[63] Yoshua Bengio,et al. FigureQA: An Annotated Figure Dataset for Visual Reasoning , 2017, ICLR.

[64] Christopher Kanan,et al. An Analysis of Visual Question Answering Algorithms , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[65] Trevor Darrell,et al. Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[66] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[67] Luke S. Zettlemoyer,et al. A Joint Model of Language and Perception for Grounded Attribute Learning , 2012, ICML.

[68] Carl Doersch,et al. Learning Visual Question Answering by Bootstrapping Hard Attention , 2018, ECCV.

[69] Li Fei-Fei,et al. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[71] Christopher D. Manning,et al. GQA: a new dataset for compositional question answering over real-world images , 2019, ArXiv.