Explanations for CommonsenseQA: New Dataset and Models

CommonsenseQA (CQA) (Talmor et al., 2019) dataset was recently released to advance the research on common-sense question answering (QA) task. Whereas the prior work has mostly focused on proposing QA models for this dataset, our aim is to retrieve as well as generate explanation for a given (question, correct answer choice, incorrect answer choices) tuple from this dataset. Our explanation definition is based on certain desiderata, and translates an explanation into a set of positive and negative common-sense properties (aka facts) which not only explain the correct answer choice but also refute the incorrect ones. We human-annotate a first-ofits-kind dataset (called ECQA) of positive and negative properties, as well as free-flow explanations, for 11K QA pairs taken from the CQA dataset. We propose a latent representation based property retrieval model as well as a GPT-2 based property generation model with a novel two step fine-tuning procedure. We also propose a free-flow explanation generation model. Extensive experiments show that our retrieval model beats BM25 baseline by a relative gain of 100% in F1 score, property generation model achieves a respectable F1 score of 36.4, and free-flow generation model achieves a similarity score of 61.9, where last two scores are based on a human correlated semantic similarity metric.

[1]  Eric Nyberg,et al.  Towards Generalizable Neuro-Symbolic Systems for Commonsense Question Answering , 2019, EMNLP.

[2]  Quoc V. Le,et al.  Neural Symbolic Reader: Scalable Integration of Distributed and Symbolic Representations for Reading Comprehension , 2020, ICLR.

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Peter Clark,et al.  Transformers as Soft Reasoners over Language , 2020, ArXiv.

[5]  Yejin Choi,et al.  MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms , 2019, NAACL.

[6]  Eneko Agirre,et al.  SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity , 2012, *SEMEVAL.

[7]  Richard Socher,et al.  Explain Yourself! Leveraging Language Models for Commonsense Reasoning , 2019, ACL.

[8]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[9]  Nelson F. Liu,et al.  Crowdsourcing Multiple Choice Science Questions , 2017, NUT@EMNLP.

[10]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[11]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[12]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[13]  Matthew Henderson,et al.  Efficient Natural Language Response Suggestion for Smart Reply , 2017, ArXiv.

[14]  Ronan Le Bras,et al.  Generative Data Augmentation for Commonsense Reasoning , 2020, EMNLP 2020.

[15]  Keh-Yih Su,et al.  A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , 2020, ACL.

[16]  Shalini Ghosh,et al.  Generating Natural Language Explanations for Visual Question Answering using Scene Graphs and Visual Attention , 2019, ArXiv.

[17]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[18]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[19]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[20]  Hai Zhao,et al.  Retrospective Reader for Machine Reading Comprehension , 2020, AAAI.

[21]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[23]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[24]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[25]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Helmut Horacek,et al.  Requirements for Conceptual Representations of Explanations and How Reasoning Systems Can Serve Them , 2017 .

[28]  Francesca Toni,et al.  Explainable Automated Fact-Checking for Public Health Claims , 2020, EMNLP.

[29]  Mihai Surdeanu,et al.  Unsupervised Alignment-based Iterative Evidence Retrieval for Multi-hop Question Answering , 2020, ACL.

[30]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[31]  Jens Lehmann,et al.  LC-QuAD: A Corpus for Complex Question Answering over Knowledge Graphs , 2017, SEMWEB.

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Eunsol Choi,et al.  TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , 2017, ACL.

[34]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[35]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[36]  Tushar Khot,et al.  QASC: A Dataset for Question Answering via Sentence Composition , 2020, AAAI.

[37]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[38]  Elena Cabrio,et al.  Question Answering over Linked Data (QALD-5) , 2014, CLEF.

[39]  Yue Zhang,et al.  Does it Make Sense? And Why? A Pilot Study for Sense Making and Explanation , 2019, ACL.

[40]  L. Venkata Subramaniam,et al.  Translucent Answer Predictions in Multi-Hop Reading Comprehension , 2020, AAAI.

[41]  Peter A. Jansen,et al.  WorldTree V2: A Corpus of Science-Domain Structured Explanations and Inference Patterns supporting Multi-Hop Inference , 2020, LREC.

[42]  Clayton T. Morrison,et al.  WorldTree: A Corpus of Explanation Graphs for Elementary Science Questions supporting Multi-hop Inference , 2018, LREC.

[43]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[44]  Miguel Angel Ríos Gaona,et al.  Methods for measuring semantic similarity of texts , 2014 .

[45]  Yu Cheng,et al.  FreeLB: Enhanced Adversarial Training for Natural Language Understanding , 2020, ICLR.