Dynabench: Rethinking Benchmarking in NLP

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

[1]  Yejin Choi,et al.  Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics , 2020, EMNLP.

[2]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[3]  Christopher Joseph Pal,et al.  Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study , 2019, ACL.

[4]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[5]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[6]  Pasquale Minervini,et al.  Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge , 2018, CoNLL.

[7]  Leon Derczynski,et al.  Directions in Abusive Language Training Data: Garbage In, Garbage Out , 2020, ArXiv.

[8]  Henry Lieberman,et al.  A model of textual affect sensing using real-world knowledge , 2003, IUI '03.

[9]  Victor Sanchez,et al.  Studies on Natural Logic and Categorial Grammar , 1991 .

[10]  Mohit Bansal,et al.  ConjNLI: Natural Language Inference over Conjunctive Sentences , 2020, EMNLP.

[11]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[12]  Victor Kuperman,et al.  Crowdsourcing and language studies: the new generation of linguistic data , 2010, Mturk@HLT-NAACL.

[13]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[14]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[15]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[16]  Ingmar Weber,et al.  Understanding Abuse: A Typology of Abusive Language Detection Subtasks , 2017, ALW@ACL.

[17]  Yuchen Zhang,et al.  CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes , 2012, EMNLP-CoNLL Shared Task.

[18]  Sebastian Riedel,et al.  Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension , 2020, Transactions of the Association for Computational Linguistics.

[19]  Ido Dagan,et al.  Diversify Your Datasets: Analyzing Generalization via Controlled Variance in Adversarial Datasets , 2019, CoNLL.

[20]  Francis Ferraro,et al.  The Universal Decompositional Semantics Dataset and Decomp Toolkit , 2019, LREC.

[21]  Douwe Kiela,et al.  Learning from the Worst: Dynamically Generated Datasets to Improve Online Hate Detection , 2021, Annual Meeting of the Association for Computational Linguistics.

[22]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[23]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[25]  Eduardo Blanco,et al.  An Analysis of Natural Language Inference Benchmarks through the Lens of Negation , 2020, EMNLP.

[26]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[27]  Peter Henderson,et al.  With Little Power Comes Great Responsibility , 2020, EMNLP.

[28]  Dejing Dou,et al.  HotFlip: White-Box Adversarial Examples for Text Classification , 2017, ACL.

[29]  Dejing Dou,et al.  On Adversarial Examples for Character-Level Neural Machine Translation , 2018, COLING.

[30]  J. Benthem A brief history of natural logic , 2008 .

[31]  Huda Khayrallah,et al.  On the Impact of Various Types of Noise on Neural Machine Translation , 2018, NMT@ACL.

[32]  Roger Levy,et al.  SyntaxGym: An Online Platform for Targeted Evaluation of Language Models , 2020, ACL.

[33]  Tal Linzen,et al.  How Can We Accelerate Progress Towards Human-like Linguistic Generalization? , 2020, ACL.

[34]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[35]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[36]  Navneet Kaur,et al.  Opinion mining and sentiment analysis , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[37]  Dan Jurafsky,et al.  Utility Is in the Eye of the User: A Critique of NLP Leaderboard Design , 2020, EMNLP.

[38]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[39]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[40]  Sebastian Riedel,et al.  Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets , 2020, EACL.

[41]  Alec Radford,et al.  Learning to summarize from human feedback , 2020, NeurIPS.

[42]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[43]  Roy Schwartz,et al.  Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets , 2019, NAACL.

[44]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[45]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[46]  Christopher Potts,et al.  Stress-Testing Neural Models of Natural Language Inference with Multiply-Quantified Sentences , 2018, ArXiv.

[47]  Susan Benesch,et al.  Dangerous speech and dangerous ideology: an integrated model for monitoring and prevention , 2016 .

[48]  Mohit Bansal,et al.  Analyzing Compositionality-Sensitivity of NLI Models , 2018, AAAI.

[49]  Rachel Rudinger,et al.  Gender Bias in Coreference Resolution , 2018, NAACL.

[50]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[51]  Jason Weston,et al.  Learning from Dialogue after Deployment: Feed Yourself, Chatbot! , 2019, ACL.

[52]  Rachel Rudinger,et al.  Lexicosyntactic Inference in Neural Models , 2018, EMNLP.

[53]  Samuel R. Bowman,et al.  CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models , 2020, EMNLP.

[54]  Jason Weston,et al.  Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack , 2019, EMNLP.

[55]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[56]  Adina Williams,et al.  Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition , 2020, ACL.

[57]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[58]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[59]  Christopher Potts,et al.  Sentiment expression conditioned by affective transitions and social forces , 2014, KDD.

[60]  Cecilia Ovesdotter Alm,et al.  Emotions from Text: Machine Learning for Text-based Emotion Prediction , 2005, HLT.

[61]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[62]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[63]  Christopher Potts,et al.  DynaSent: A Dynamic Benchmark for Sentiment Analysis , 2020, ACL.

[64]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[65]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[66]  Christopher D. Manning,et al.  Towards Ecologically Valid Research on Language User Interfaces , 2020, ArXiv.

[67]  Allyson Ettinger,et al.  Assessing Phrasal Representation and Composition in Transformers , 2020, EMNLP.

[68]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[69]  Jason Weston,et al.  The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents , 2020, ACL.

[70]  Jason Weston,et al.  Mastering the Dungeon: Grounded Language Learning by Mechanical Turker Descent , 2017, ICLR.

[71]  Dong Nguyen,et al.  HateCheck: Functional Tests for Hate Speech Detection Models , 2021, ACL/IJCNLP.

[72]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[73]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[74]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[75]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[76]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[77]  Yonatan Belinkov,et al.  Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference , 2019, ACL.

[78]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[79]  Qiang Yang,et al.  Lifelong Machine Learning Systems: Beyond Learning Algorithms , 2013, AAAI Spring Symposium: Lifelong Machine Learning.

[80]  Tal Linzen,et al.  COGS: A Compositional Generalization Challenge Based on Semantic Interpretation , 2020, EMNLP.

[81]  Shikha Bordia,et al.  Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs , 2019, EMNLP.

[82]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[83]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[84]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[85]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[86]  Laura A. Dabbish,et al.  Designing games with a purpose , 2008, CACM.

[87]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[88]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[89]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[90]  Scott A. Hale,et al.  Challenges and frontiers in abusive content detection , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[91]  Chandler May,et al.  On Measuring Social Biases in Sentence Encoders , 2019, NAACL.

[92]  Emily M. Bender,et al.  Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task , 2017, Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems.

[93]  Samuel R. Bowman,et al.  BLiMP: A Benchmark of Linguistic Minimal Pairs for English , 2019, SCIL.

[94]  Samuel R. Bowman,et al.  Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data , 2020, INSIGHTS.

[95]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[96]  Hao Tan,et al.  The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions , 2020, EMNLP.

[97]  Viviana Patti,et al.  Resources and benchmark corpora for hate speech detection: a systematic review , 2020, Language Resources and Evaluation.

[98]  Yuxing Chen,et al.  Harnessing the linguistic signal to predict scalar inferences , 2019, ACL.

[99]  Mitsuru Ishizuka,et al.  Recognition of Affect, Judgment, and Appreciation in Text , 2010, COLING.

[100]  Panagiotis G. Ipeirotis,et al.  Beat the Machine: Challenging Humans to Find a Predictive Model's “Unknown Unknowns” , 2015, JDIQ.

[101]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[102]  Kentaro Inui,et al.  Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets , 2019, AAAI.

[103]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.