Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements

Despite the much discussed capabilities of today's language models, they are still prone to silly and unexpected commonsense failures. We consider a retrospective verification approach that reflects on the correctness of LM outputs, and introduce Vera, a general-purpose model that estimates the plausibility of declarative statements based on commonsense knowledge. Trained on ~7M commonsense statements created from 19 QA datasets and two large-scale knowledge bases, and with a combination of three training objectives, Vera is a versatile model that effectively separates correct from incorrect statements across diverse commonsense domains. When applied to solving commonsense problems in the verification format, Vera substantially outperforms existing models that can be repurposed for commonsense verification, and it further exhibits generalization capabilities to unseen tasks and provides well-calibrated outputs. We find that Vera excels at filtering LM-generated commonsense knowledge and is useful in detecting erroneous commonsense statements generated by models like ChatGPT in real-world settings.

[1]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[2]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[3]  E. Davis Benchmarks for Automated Commonsense Reasoning: A Survey , 2023, ACM Comput. Surv..

[4]  A. Borji A Categorical Archive of ChatGPT Failures , 2023, ArXiv.

[5]  Ronan Le Bras,et al.  I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation , 2022, ArXiv.

[6]  Swarat Chaudhuri,et al.  Natural Language Deduction with Incomplete Information , 2022, EMNLP.

[7]  Yuhuai Wu,et al.  Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs , 2022, ICLR.

[8]  Oyvind Tafjord,et al.  Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning , 2022, EMNLP.

[9]  Yejin Choi,et al.  Rainier: Reinforced Knowledge Introspector for Commonsense Question Answering , 2022, EMNLP.

[10]  Danqi Chen,et al.  Generating Natural Language Proofs with Verifier-Guided Search , 2022, EMNLP.

[11]  Ronan Le Bras,et al.  Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations , 2022, EMNLP.

[12]  Dongyan Zhao,et al.  Things not Written in Text: Exploring Spatial Commonsense from Visual Signals , 2022, ACL.

[13]  Hannaneh Hajishirzi,et al.  UnifiedQA-v2: Stronger Generalization via Broader Cross-Format Training , 2022, ArXiv.

[14]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[15]  Swarat Chaudhuri,et al.  Natural Language Deduction through Search over Statement Compositions , 2022, EMNLP.

[16]  Noah A. Smith,et al.  WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation , 2022, EMNLP.

[17]  Ronan Le Bras,et al.  Generated Knowledge Prompting for Commonsense Reasoning , 2021, ACL.

[18]  Ronan Le Bras,et al.  Symbolic Knowledge Distillation: from General Language Models to Commonsense Models , 2021, NAACL.

[19]  Owain Evans,et al.  TruthfulQA: Measuring How Models Mimic Human Falsehoods , 2021, ACL.

[20]  Oyvind Tafjord,et al.  General-Purpose Question-Answering with Macaw , 2021, ArXiv.

[21]  Eunsol Choi,et al.  CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge , 2021, NeurIPS Datasets and Benchmarks.

[22]  Yejin Choi,et al.  CommonsenseQA 2.0: Exposing the Limits of AI through Gamification , 2021, NeurIPS Datasets and Benchmarks.

[23]  Alessandro Roncone,et al.  PROST: Physical Reasoning about Objects through Space and Time , 2021, FINDINGS.

[24]  Nanyun Peng,et al.  COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences , 2021, FINDINGS.

[25]  Eunsol Choi,et al.  Can NLI Models Verify QA Systems' Predictions? , 2021, EMNLP.

[26]  Yejin Choi,et al.  UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark , 2021, AAAI.

[27]  Jonathan Berant,et al.  Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , 2021, Transactions of the Association for Computational Linguistics.

[28]  Yejin Choi,et al.  COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs , 2020, AAAI.

[29]  Yejin Choi,et al.  Thinking Like a Skeptic: Defeasible Inference in Natural Language , 2020, FINDINGS.

[30]  Yue Zhang,et al.  SemEval-2020 Task 4: Commonsense Validation and Explanation , 2020, SEMEVAL.

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  Hannaneh Hajishirzi,et al.  UnifiedQA: Crossing Format Boundaries With a Single QA System , 2020, FINDINGS.

[33]  Bill Yuchen Lin,et al.  Birds Have Four Legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models , 2020, EMNLP.

[34]  Peter Clark,et al.  GenericsKB: A Knowledge Base of Generic Statements , 2020, ArXiv.

[35]  Hannaneh Hajishirzi,et al.  Fact or Fiction: Verifying Scientific Claims , 2020, EMNLP.

[36]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[37]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[38]  Ashish Sabharwal,et al.  QASC: A Dataset for Question Answering via Sentence Composition , 2019, AAAI.

[39]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[40]  Ronan Le Bras,et al.  WinoGrande , 2019, AAAI.

[41]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[42]  Peter Clark,et al.  QuaRTz: An Open-Domain Dataset of Qualitative Relationship Questions , 2019, EMNLP.

[43]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[44]  Ali Farhadi,et al.  HellaSwag: Can a Machine Really Finish Your Sentence? , 2019, ACL.

[45]  Peter Clark,et al.  QuaRel: A Dataset and Models for Answering Questions about Qualitative Relationships , 2018, AAAI.

[46]  Yejin Choi,et al.  Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[47]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[48]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[49]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[50]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[51]  Oren Etzioni,et al.  Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , 2018, ArXiv.

[52]  Nelson F. Liu,et al.  Crowdsourcing Multiple Choice Science Questions , 2017, NUT@EMNLP.

[53]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[54]  Sheng Zhang,et al.  Ordinal Common-sense Inference , 2016, TACL.

[55]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[56]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[57]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[59]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[60]  Raymond Reiter,et al.  A Logic for Default Reasoning , 1987, Artif. Intell..