论文信息 - ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning - 字舞流文

ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning

Given questions regarding some prototypical situation -- such as Name something that people usually do before they leave the house for work? -- a human can easily answer them via acquired experiences. There can be multiple right answers for such questions with some more common for a situation than others. This paper introduces a new question answering dataset for training and evaluating common-sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international trivia game show -- Family Feud. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers. We also propose an open-domain task where a model has to output a ranked list of answers, ideally covering all prototypical answers for a question. On evaluating our dataset with various competitive state-of-the-art models, we find there is a significant gap between the best model and human performance on a number of evaluation metrics.

Rajarshi Das | Andrew McCallum | Michael Boratko | Tim O'Gorman | Xiang Lorraine Li | Dan Le | R. Das | A. McCallum | Michael Boratko | Timothy J. O'Gorman | Xiang Lorraine Li | Daniel Le | Rajarshi Das

[1] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[2] Roger C. Schank,et al. Scripts, plans, goals and understanding: an inquiry into human knowledge structures , 1978 .

[3] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[4] Ming Zhou,et al. Improving Question Answering by Commonsense-Based Pre-Training , 2018, NLPCC.

[5] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[6] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[7] Alon Lavie,et al. The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[8] Yejin Choi,et al. PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[9] Doug Downey,et al. Abductive Commonsense Reasoning , 2019, ICLR.

[10] Xiaoqiang Luo,et al. An Extension of BLANC to System Mentions , 2014, ACL.

[11] Ali Farhadi,et al. HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[12] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[13] Benjamin Van Durme,et al. Reporting bias and knowledge acquisition , 2013, AKBC '13.

[14] Jordan L. Boyd-Graber,et al. Quizbowl: The Case for Incremental Question Answering , 2019, ArXiv.

[15] Benjamin Van Durme,et al. On the Existence of Tacit Assumptions in Contextualized Language Models , 2020, ArXiv.

[16] Nathanael Chambers,et al. A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.

[17] Catherine Havasi,et al. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[18] Alexander Yates,et al. Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment , 2011, ACL.

[19] Yejin Choi,et al. ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.

[20] Harold W. Kuhn,et al. The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[21] Carl E. Rasmussen,et al. In Advances in Neural Information Processing Systems , 2011 .

[22] Jonathan Berant,et al. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[23] Yejin Choi,et al. The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task , 2017, CoNLL.

[24] Nanyun Peng,et al. The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[25] Francis Ferraro,et al. A Unified Bayesian Model of Scripts, Frames and Language , 2016, AAAI.

[26] Rachel Rudinger,et al. Causal Inference of Script Knowledge , 2020, EMNLP.

[27] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28] J. Munkres. ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[29] Zhijian Ou,et al. Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context , 2019, AAAI.

[30] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[31] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[32] Eunsol Choi,et al. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[33] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[34] Nicola Pellicano,et al. Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning , 2020, ACL.

[35] Eduard Hovy,et al. Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[36] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[37] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[38] Yejin Choi,et al. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[39] Nathanael Chambers,et al. Unsupervised Learning of Narrative Schemas and their Participants , 2009, ACL.

[40] C. Fillmore. FRAME SEMANTICS AND THE NATURE OF LANGUAGE * , 1976 .

[41] Zhonghai Wu,et al. Diverse and Informative Dialogue Generation with Context-Specific Commonsense Knowledge Awareness , 2020, ACL.

[42] Timnit Gebru,et al. Datasheets for datasets , 2018, Commun. ACM.

[43] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44] H. Lieberman. Common Consensus : a web-based game for collecting commonsense goals , 2007 .

[45] Rajarshi Das,et al. A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset , 2018, QA@ACL.

[46] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[47] Sheng Zhang,et al. Ordinal Common-sense Inference , 2016, TACL.

[48] Manuel Blum,et al. Verbosity: a game for collecting common-sense facts , 2006, CHI.

[49] Hector J. Levesque,et al. The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[50] Yejin Choi,et al. Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[51] Philipp Koehn,et al. Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.

[52] M. R E C A S E,et al. BLANC: Implementing the Rand index for coreference evaluation , 2010, Natural Language Engineering.

[53] Yejin Choi,et al. Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , 2019, EMNLP.

[54] Zornitsa Kozareva,et al. SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[55] Wai Lam,et al. Evaluation Challenges in Large-Scale Document Summarization , 2003, ACL.

[56] Xiaoyan Wang,et al. Improving Natural Language Inference Using External Knowledge in the Science Questions Domain , 2018, AAAI.

[57] Rachel Rudinger,et al. Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.