论文信息 - DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs - 字舞流文

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 55k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs, as they remove the paraphrase-and-entity-typing shortcuts available in prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literatures on this dataset and show that the best systems only achieve 38.4% F1 on our generalized accuracy metric, while expert human performance is 96%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.

Gabriel Stanovsky | Sameer Singh | Pradeep Dasigi | Matt Gardner | Dheeru Dua | Yizhong Wang | Matt Gardner | Pradeep Dasigi | Gabriel Stanovsky | Sameer Singh | Dheeru Dua | Yizhong Wang

[1] Mitesh M. Khapra,et al. DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension , 2018, ACL.

[2] Chris Dyer,et al. The NarrativeQA Reading Comprehension Challenge , 2017, TACL.

[3] Raymond J. Mooney,et al. Learning to Interpret Natural Language Navigation Instructions from Observations , 2011, Proceedings of the AAAI Conference on Artificial Intelligence.

[4] Quoc V. Le,et al. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[5] Danqi Chen,et al. CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[6] Christopher D. Manning,et al. The Stanford Typed Dependencies Representation , 2008, CF+CDPE@COLING.

[7] Sebastian Riedel,et al. Constructing Datasets for Multi-hop Reading Comprehension Across Documents , 2017, TACL.

[8] Ido Dagan,et al. Supervised Open Information Extraction , 2018, NAACL.

[9] Jayant Krishnamurthy,et al. Neural Semantic Parsing with Type Constraints for Semi-Structured Tables , 2017, EMNLP.

[10] Raymond J. Mooney,et al. Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[11] Dan Roth,et al. Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences , 2018, NAACL.

[12] Jonathan Berant,et al. The Web as a Knowledge-Base for Answering Complex Questions , 2018, NAACL.

[13] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[14] Luke S. Zettlemoyer,et al. AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[15] Yejin Choi,et al. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[16] Percy Liang,et al. Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[17] Chen Liang,et al. Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision , 2016, ACL.

[18] Zachary C. Lipton,et al. How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[19] Pasquale Minervini,et al. Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge , 2018, CoNLL.

[20] Peter Clark,et al. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[21] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Simon Ostermann,et al. MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge , 2018, LREC.

[23] Oren Etzioni,et al. Open Information Extraction from the Web , 2007, CACM.

[24] Eunsol Choi,et al. QuAC: Question Answering in Context , 2018, EMNLP.

[25] Luke S. Zettlemoyer,et al. Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[26] Luke S. Zettlemoyer,et al. Learning to Automatically Solve Algebra Word Problems , 2014, ACL.

[27] Luke S. Zettlemoyer,et al. Deep Semantic Role Labeling: What Works and What’s Next , 2017, ACL.

[28] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[29] Oren Etzioni,et al. Parsing Algebraic Word Problems into Equations , 2015, TACL.

[30] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[31] Nando de Freitas,et al. Neural Programmer-Interpreters , 2015, ICLR.

[32] Christopher Clark,et al. Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[33] Timothy Dozat,et al. Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task , 2017, CoNLL.

[34] Ali Farhadi,et al. Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[35] Walter Daelemans,et al. CliCR: a Dataset of Clinical Case Reports for Machine Reading Comprehension , 2018, NAACL.

[36] Sandro Pezzelle,et al. The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[37] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[38] Jian Peng,et al. emrQA: A Large Corpus for Question Answering on Electronic Medical Records , 2018, EMNLP.

[39] Quoc V. Le,et al. Neural Programmer: Inducing Latent Programs with Gradient Descent , 2015, ICLR.

[40] Yoshua Bengio,et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[41] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[42] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[43] Oren Etzioni,et al. Learning to Solve Arithmetic Word Problems with Verb Categorization , 2014, EMNLP.

[44] Xiaodong Liu,et al. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension , 2018, ArXiv.

[45] Bhavana Dalvi,et al. Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension , 2018, NAACL.

[46] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[47] Oren Etzioni,et al. Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions , 2016, AAAI.

[48] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[49] Wang Ling,et al. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[50] Danqi Chen,et al. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[51] Xavier Carreras,et al. Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling , 2005, CoNLL.

[52] Andrew Chou,et al. Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.