ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning

Given questions regarding some prototypical situation -- such as Name something that people usually do before they leave the house for work? -- a human can easily answer them via acquired experiences. There can be multiple right answers for such questions with some more common for a situation than others. This paper introduces a new question answering dataset for training and evaluating common-sense reasoning capabilities of artificial intelligence systems in such prototypical situations. The training set is gathered from an existing set of questions played in a long-running international trivia game show -- Family Feud. The hidden evaluation set is created by gathering answers for each question from 100 crowd-workers. We also propose an open-domain task where a model has to output a ranked list of answers, ideally covering all prototypical answers for a question. On evaluating our dataset with various competitive state-of-the-art models, we find there is a significant gap between the best model and human performance on a number of evaluation metrics.

[1]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[2]  Roger C. Schank,et al.  Scripts, plans, goals and understanding: an inquiry into human knowledge structures , 1978 .

[3]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[4]  Ming Zhou,et al.  Improving Question Answering by Commonsense-Based Pre-Training , 2018, NLPCC.

[5]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[6]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[7]  Alon Lavie,et al.  The Meteor metric for automatic evaluation of machine translation , 2009, Machine Translation.

[8]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[9]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[10]  Xiaoqiang Luo,et al.  An Extension of BLANC to System Mentions , 2014, ACL.

[11]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[12]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[13]  Benjamin Van Durme,et al.  Reporting bias and knowledge acquisition , 2013, AKBC '13.

[14]  Jordan L. Boyd-Graber,et al.  Quizbowl: The Case for Incremental Question Answering , 2019, ArXiv.

[15]  Benjamin Van Durme,et al.  On the Existence of Tacit Assumptions in Contextualized Language Models , 2020, ArXiv.

[16]  Nathanael Chambers,et al.  A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.

[17]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[18]  Alexander Yates,et al.  Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment , 2011, ACL.

[19]  Yejin Choi,et al.  ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.

[20]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[21]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[22]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[23]  Yejin Choi,et al.  The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task , 2017, CoNLL.

[24]  Nanyun Peng,et al.  The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[25]  Francis Ferraro,et al.  A Unified Bayesian Model of Scripts, Frames and Language , 2016, AAAI.

[26]  Rachel Rudinger,et al.  Causal Inference of Script Knowledge , 2020, EMNLP.

[27]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[28]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[29]  Zhijian Ou,et al.  Task-Oriented Dialog Systems that Consider Multiple Appropriate Responses under the Same Context , 2019, AAAI.

[30]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[31]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[32]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[33]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[34]  Nicola Pellicano,et al.  Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning , 2020, ACL.

[35]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[36]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[37]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[38]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[39]  Nathanael Chambers,et al.  Unsupervised Learning of Narrative Schemas and their Participants , 2009, ACL.

[40]  C. Fillmore FRAME SEMANTICS AND THE NATURE OF LANGUAGE * , 1976 .

[41]  Zhonghai Wu,et al.  Diverse and Informative Dialogue Generation with Context-Specific Commonsense Knowledge Awareness , 2020, ACL.

[42]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  H. Lieberman Common Consensus : a web-based game for collecting commonsense goals , 2007 .

[45]  Rajarshi Das,et al.  A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset , 2018, QA@ACL.

[46]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[47]  Sheng Zhang,et al.  Ordinal Common-sense Inference , 2016, TACL.

[48]  Manuel Blum,et al.  Verbosity: a game for collecting common-sense facts , 2006, CHI.

[49]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[50]  Yejin Choi,et al.  Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[51]  Philipp Koehn,et al.  Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.

[52]  M. R E C A S E,et al.  BLANC: Implementing the Rand index for coreference evaluation , 2010, Natural Language Engineering.

[53]  Yejin Choi,et al.  Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , 2019, EMNLP.

[54]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[55]  Wai Lam,et al.  Evaluation Challenges in Large-Scale Document Summarization , 2003, ACL.

[56]  Xiaoyan Wang,et al.  Improving Natural Language Inference Using External Knowledge in the Science Questions Domain , 2018, AAAI.

[57]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.