Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. This paper suggests that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning. To support this claim, we first show that the current evaluation method of WS is sub-optimal and propose a modification that uses twin sentences for evaluation. We also propose two new baselines that indicate the existence of artifacts in WS benchmarks. We then develop a method for evaluating WS-like sentences in a zero-shot setting to account for the commonsense reasoning abilities acquired during the pretraining and observe that popular language models perform randomly in this setting when using our more strict evaluation. We conclude that the observed progress is mostly due to the use of supervision in training WS models, which is not likely to successfully support all the required commonsense reasoning skills and knowledge.1

[1]  Johannes Fähndrich,et al.  A Marker Passing Approach to Winograd Schemas , 2018, JIST.

[2]  Chitta Baral,et al.  Towards Addressing the Winograd Schema Challenge - Building and Using a Semantic Parser and a Knowledge Hunting Module , 2015, IJCAI.

[3]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[4]  Hongming Zhang,et al.  SP-10K: A Large-scale Evaluation Set for Selectional Preference Acquisition , 2019, ACL.

[5]  Adam Trischler,et al.  How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG , 2018, EMNLP.

[6]  Gerhard Weikum,et al.  Acquiring Comparative Commonsense Knowledge from the Web , 2014, AAAI.

[7]  Yejin Choi,et al.  ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.

[8]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[9]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[10]  Abhijit Mahabal,et al.  How Large Are Lions? Inducing Distributions over Quantitative Attributes , 2019, ACL.

[11]  Yu Hu,et al.  Cause-Effect Knowledge Acquisition and Neural Association Model for Solving A Set of Winograd Schema Problems , 2017, IJCAI.

[12]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[15]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[16]  Yuliya Lierler,et al.  The Winograd Schema Challenge and Reasoning about Correlation , 2015, AAAI Spring Symposia.

[17]  Yejin Choi,et al.  Event2Mind: Commonsense Inference on Events, Intents, and Reactions , 2018, ACL.

[18]  Julian Michael The Theory of Correlation Formulas and Their Application to Discourse Coherence , 2015 .

[19]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[20]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[21]  Jackie Chi Kit Cheung,et al.  An Analysis of Dataset Overlap on Winograd-Style Tasks , 2020, COLING.

[22]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[23]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[24]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[25]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[26]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[27]  Anette Frank,et al.  Addressing the Winograd Schema Challenge as a Sequence Ranking Task , 2018 .

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[30]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[31]  Samuel R. Bowman,et al.  Precise Task Formalization Matters in Winograd Schema Evaluations , 2020, EMNLP.

[32]  Tassilo Klein,et al.  Contrastive Self-Supervised Learning for Commonsense Reasoning , 2020, ACL.

[33]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Xin Liu,et al.  ASER: A Large-scale Eventuality Knowledge Graph , 2019, WWW.

[35]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[36]  Ido Dagan,et al.  Recognizing Textual Entailment: Models and Applications , 2013, Recognizing Textual Entailment: Models and Applications.

[37]  Yonatan Belinkov,et al.  The Sensitivity of Language Models and Humans to Winograd Schema Perturbations , 2020, ACL.

[38]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[39]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[40]  Samuel R. Bowman,et al.  BLiMP: A Benchmark of Linguistic Minimal Pairs for English , 2019, SCIL.

[41]  Xinlei Chen,et al.  Cycle-Consistency for Robust Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jonathan Berant,et al.  oLMpics-On What Language Model Pre-training Captures , 2019, Transactions of the Association for Computational Linguistics.

[43]  Yejin Choi,et al.  Do Neural Language Representations Learn Physical Commonsense? , 2019, CogSci.

[44]  Jackie Chi Kit Cheung,et al.  A Knowledge Hunting Framework for Common Sense Reasoning , 2018, EMNLP.

[45]  Yu Hu,et al.  Combing Context and Commonsense Knowledge Through Neural Networks for Solving Winograd Schema Problems , 2017, AAAI Spring Symposia.

[46]  Loizos Michael,et al.  Tackling the Winograd Schema Challenge Through Machine Logical Inferences , 2016, STAIRS.

[47]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[48]  Leyang Cui,et al.  Evaluating Commonsense in Pre-trained Language Models , 2019, AAAI.

[49]  Eduard Hovy,et al.  Measuring and Improving Consistency in Pretrained Language Models , 2021, Transactions of the Association for Computational Linguistics.

[50]  Thomas Lukasiewicz,et al.  A Surprisingly Robust Trick for the Winograd Schema Challenge , 2019, ACL.

[51]  Ali Farhadi,et al.  Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects , 2016, AAAI.

[52]  Arpit Sharma Using Answer Set Programming for Commonsense Reasoning in the Winograd Schema Challenge , 2019, Theory Pract. Log. Program..

[53]  Yejin Choi,et al.  Verb Physics: Relative Physical Knowledge of Actions and Objects , 2017, ACL.

[54]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[55]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[56]  Yejin Choi,et al.  Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[57]  Dan Roth,et al.  “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , 2019, EMNLP.

[58]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[59]  Hongming Zhang,et al.  WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge , 2020, ACL.

[60]  Vincent Ng,et al.  Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge , 2012, EMNLP.

[61]  Tassilo Klein,et al.  Attention Is (not) All You Need for Commonsense Reasoning , 2019, ACL.

[62]  Thomas Lukasiewicz,et al.  A Review of Winograd Schema Challenge Datasets and Approaches , 2020, ArXiv.

[63]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[64]  Lifu Tu,et al.  An Empirical Study on Robustness to Spurious Correlations using Pre-trained Language Models , 2020, Transactions of the Association for Computational Linguistics.

[65]  Dan Roth,et al.  Solving Hard Coreference Problems , 2019, NAACL.

[66]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[67]  Yejin Choi,et al.  WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale , 2020, AAAI.