论文信息 - Natural Questions: A Benchmark for Question Answering Research - 字舞流文

Natural Questions: A Benchmark for Question Answering Research

We present the Natural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine. An annotator is presented with a question along with a Wikipedia page from the top 5 search results, and annotates a long answer (typically a paragraph) and a short answer (one or more entities) if present on the page, or marks null if no long/short answer is present. The public release consists of 307,373 training examples with single annotations; 7,830 examples with 5-way annotations for development data; and a further 7,842 examples with 5-way annotated sequestered as test data. We present experiments validating quality of the data. We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

Ming-Wei Chang | Quoc V. Le | Kenton Lee | Andrew M. Dai | Slav Petrov | Kristina Toutanova | Jakob Uszkoreit | Ankur P. Parikh | Chris Alberti | Michael Collins | Jacob Devlin | Tom Kwiatkowski | Illia Polosukhin | Jennimaria Palomaki | Llion Jones | Matthew Kelcey | Quoc Le | Olivia Redfield | Danielle Epstein | Jacob Devlin | Ming-Wei Chang | Kenton Lee | Kristina Toutanova | Slav Petrov | Llion Jones | Matthew Kelcey | Jakob Uszkoreit | Illia Polosukhin | Michael Collins | Chris Alberti | Jennimaria Palomaki | T. Kwiatkowski | Olivia Redfield | D. Epstein | J. Palomaki

[1] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2] Xinyan Xiao,et al. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications , 2017, QA@ACL.

[3] David A. McAllester,et al. Who did What: A Large-Scale Person-Centered Cloze Dataset , 2016, EMNLP.

[4] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[5] Sandro Pezzelle,et al. The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[6] Jason Weston,et al. The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations , 2015, ICLR.

[7] Chris Dyer,et al. The NarrativeQA Reading Comprehension Challenge , 2017, TACL.

[8] Kenton Lee,et al. A BERT Baseline for the Natural Questions , 2019, ArXiv.

[9] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[10] Nathan Schneider,et al. Association for Computational Linguistics: Human Language Technologies , 2011 .

[11] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.

[12] László Györfi,et al. A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[13] Jakob Uszkoreit,et al. A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[14] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[15] Danqi Chen,et al. CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[16] Eunsol Choi,et al. QuAC: Question Answering in Context , 2018, EMNLP.

[17] Yi Yang,et al. WikiQA: A Challenge Dataset for Open-Domain Question Answering , 2015, EMNLP.

[18] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[19] Guokun Lai,et al. RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[20] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[21] Matthew Richardson,et al. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text , 2013, EMNLP.

[22] Yoshua Bengio,et al. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[23] Peter Clark,et al. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[24] Percy Liang,et al. Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[25] Jason Weston,et al. Reading Wikipedia to Answer Open-Domain Questions , 2017, ACL.

[26] Marti A. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[27] Christopher Clark,et al. Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[28] Danqi Chen,et al. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task , 2016, ACL.

[29] Jianfeng Gao,et al. A Human Generated MAchine Reading COmprehension Dataset , 2018 .