What Will it Take to Fix Benchmarking in Natural Language Understanding?

Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.

[1]  Emily Denton,et al.  Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure , 2020, FAccT.

[2]  Hinrich Schütze,et al.  It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , 2020, NAACL.

[3]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[4]  Timnit Gebru,et al.  Datasheets for datasets , 2018, Commun. ACM.

[5]  Akiko Aizawa,et al.  Benchmarking Machine Reading Comprehension: A Psychological Perspective , 2021, EACL.

[6]  Peter Henderson,et al.  With Little Power Comes Great Responsibility , 2020, EMNLP.

[7]  Samuel R. Bowman,et al.  Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options , 2020, AACL.

[8]  Lawrence S. Moss,et al.  OCNLI: Original Chinese Natural Language Inference , 2020, FINDINGS.

[9]  Daniel Khashabi,et al.  UNQOVERing Stereotypical Biases via Underspecified Questions , 2020, FINDINGS.

[10]  Samuel R. Bowman,et al.  Intermediate-Task Transfer Learning with Pretrained Language Models: When and Why Does It Work? , 2020, ACL.

[11]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[12]  Solon Barocas,et al.  Language (Technology) is Power: A Critical Survey of “Bias” in NLP , 2020, ACL.

[13]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[14]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[15]  Jennifer Chu-Carroll,et al.  To Test Machine Comprehension, Start by Defining Comprehension , 2020, ACL.

[16]  Samuel R. Bowman,et al.  Collecting Entailment Data for Pretraining: New Protocols and Negative Results , 2020, EMNLP.

[17]  Jacob Andreas,et al.  Experience Grounds Language , 2020, EMNLP.

[18]  Yejin Choi,et al.  Evaluating Machines by their Real-World Language Use , 2020, ArXiv.

[19]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[20]  Ronan Le Bras,et al.  Adversarial Filters of Dataset Biases , 2020, ICML.

[21]  Kentaro Inui,et al.  Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets , 2019, AAAI.

[22]  Jordan L. Boyd-Graber What Question Answering can Learn from Trivia Nerds , 2019, ACL.

[23]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[24]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[25]  Ronan Le Bras,et al.  WinoGrande , 2019, AAAI.

[26]  Dan Jurafsky,et al.  Utility is in the Eye of the User: A Critique of NLP Leaderboards , 2020, EMNLP.

[27]  Sameer Singh,et al.  ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension , 2019, ArXiv.

[28]  Lora Aroyo,et al.  Metrology for AI: From Benchmarks to Instruments , 2019, ArXiv.

[29]  Ellie Pavlick,et al.  Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.

[30]  Jiwei Li,et al.  Large-scale Pretraining for Neural Machine Translation with Tens of Billions of Sentence Pairs , 2019, ArXiv.

[31]  Roy Schwartz,et al.  Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.

[32]  Yejin Choi,et al.  Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , 2019, EMNLP.

[33]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[34]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[35]  Kenneth Ward Church,et al.  A survey of 25 years of evaluation , 2019, Natural Language Engineering.

[36]  Hung-Yu Kao,et al.  Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[37]  Udo Kruschwitz,et al.  A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation , 2019, NAACL.

[38]  Samuel R. Bowman,et al.  Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark , 2019, ACL.

[39]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[40]  Yoav Goldberg,et al.  Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them , 2019, NAACL-HLT.

[41]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Emily M. Bender,et al.  Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science , 2018, TACL.

[44]  Juho Hamari,et al.  The Gamification of Work: Lessons From Crowdsourcing , 2018, Journal of Management Inquiry.

[45]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[46]  Jason Baldridge,et al.  Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns , 2018, TACL.

[47]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[48]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[49]  Saif Mohammad,et al.  Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems , 2018, *SEMEVAL.

[50]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[51]  Rachel Rudinger,et al.  Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.

[52]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[53]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[54]  Emily M. Bender,et al.  Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task , 2017, Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems.

[55]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[56]  Chandler May,et al.  Social Bias in Elicited Natural Language Inferences , 2017, EthNLP@EACL.

[57]  Adam Tauman Kalai,et al.  Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings , 2016, NIPS.

[58]  Sandro Pezzelle,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[59]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[60]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[61]  Bruno Guillaume,et al.  Creating Zombilingo, a game with a purpose for dependency syntax annotation , 2014, GamifIR '14.

[62]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[63]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[64]  Marjorie Florestal,et al.  Is a Burrito a Sandwich? Exploring Race, Class and Culture in Contracts , 2008 .

[65]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[66]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[67]  Stephen Pulman,et al.  Using the Framework , 1996 .