ANLIzing the Adversarial Natural Language Inference Dataset

We perform an in-depth error analysis of Adversarial NLI (ANLI), a recently introduced large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds. We propose a fine-grained annotation scheme of the different aspects of inference that are responsible for the gold classification labels, and use it to hand-code all three of the ANLI development sets. We use these annotations to answer a variety of interesting questions: which inference types are most common, which models have the highest performance on each reasoning type, and which types are the most challenging for state of-the-art models? We hope that our annotations will enable more fine-grained evaluation of models trained on ANLI, provide us with a deeper understanding of where models fail and succeed, and help us determine how to train better models in future.

[1]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[2]  J. R. Landis,et al.  An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. , 1977, Biometrics.

[3]  László Dezsö,et al.  Universal Grammar , 1981, Certainty in Action.

[4]  R. Chaffin,et al.  Cognitive and Psychometric Analysis of Analogical Problem Solving , 1990 .

[5]  Stephen Pulman,et al.  Using the Framework , 1996 .

[6]  Yaroslav Fyodorov,et al.  A Natural Logic Inference System , 2000 .

[7]  Siobhan Chapman Logic and Conversation , 2005 .

[8]  Martha Palmer,et al.  Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[9]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[10]  David R. Thomas,et al.  A General Inductive Approach for Analyzing Qualitative Evaluation Data , 2006 .

[11]  Jan-Willem Strijbos,et al.  Content analysis: What are they talking about? , 2006, Comput. Educ..

[12]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[13]  Dan Roth,et al.  “Ask Not What Textual Entailment Can Do for You...” , 2010, ACL.

[14]  Alexander Yates,et al.  Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment , 2011, ACL.

[15]  Johan Bos,et al.  Developing a large semantically annotated corpus , 2012, LREC.

[16]  Saif Mohammad,et al.  SemEval-2012 Task 2: Measuring Degrees of Relational Similarity , 2012, *SEMEVAL.

[17]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[18]  Ido Dagan,et al.  Semantic Annotation for Textual Entailment Recognition , 2012, MICAI.

[19]  Johan Bos,et al.  The Groningen Meaning Bank , 2013, JSSP.

[20]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[21]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[22]  Kevin Duh,et al.  Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework , 2017, IJCNLP.

[23]  Aaron Steven White,et al.  The role of veridicality and factivity in clause selection * , 2017 .

[24]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[25]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[26]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[27]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[28]  A. Joubert,et al.  The JeuxDeMots Project is 10 Years Old: What We have Learned , 2018 .

[29]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[30]  Rachel Rudinger,et al.  Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.

[31]  Shachar Mirkin,et al.  Listening Comprehension over Argumentative Content , 2018, EMNLP.

[32]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[33]  Percy Liang,et al.  Transforming Question Answering Datasets Into Natural Language Inference Datasets , 2018, ArXiv.

[34]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[35]  Siddharth Patwardhan,et al.  Annotating Electronic Medical Records for Question Answering , 2018, ArXiv.

[36]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[37]  Andreas Vlachos,et al.  FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[38]  Peter Clark What Knowledge is Needed to Solve the RTE5 Textual Entailment Challenge? , 2018, ArXiv.

[39]  Christopher Potts,et al.  Stress-Testing Neural Models of Natural Language Inference with Multiply-Quantified Sentences , 2018, ArXiv.

[40]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[41]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[42]  Jean-Philippe Bernardy,et al.  What Kind of Natural Language Inference are NLP Systems Learning: Is this Enough? , 2019, ICAART.

[43]  Carolyn Penstein Rosé,et al.  EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference , 2019, CoNLL.

[44]  Johan Bos,et al.  HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning , 2019, *SEMEVAL.

[45]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[46]  Ido Dagan,et al.  Diversify Your Datasets: Analyzing Generalization via Controlled Variance in Adversarial Datasets , 2019, CoNLL.

[47]  Mohit Bansal,et al.  Analyzing Compositionality-Sensitivity of NLI Models , 2018, AAAI.

[48]  Ellie Pavlick,et al.  Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.

[49]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[50]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[51]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[52]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[53]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[54]  Grzegorz Chrupala,et al.  Analyzing and interpreting neural networks for NLP: A report on the first BlackboxNLP workshop , 2019, Natural Language Engineering.

[55]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[56]  Yuxing Chen,et al.  Harnessing the linguistic signal to predict scalar inferences , 2019, ACL.

[57]  Adina Williams,et al.  Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition , 2020, ACL.

[58]  Eduardo Blanco,et al.  An Analysis of Natural Language Inference Benchmarks through the Lens of Negation , 2020, EMNLP.

[59]  Xiang Zhou,et al.  What Can We Learn from Collective Human Opinions on Natural Language Inference Data? , 2020, EMNLP.

[60]  Julian Michael,et al.  AmbigQA: Answering Ambiguous Open-domain Questions , 2020, EMNLP.

[61]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[62]  Yejin Choi,et al.  Commonsense Reasoning for Natural Language Processing , 2020, ACL.

[63]  Yejin Choi,et al.  Thinking Like a Skeptic: Defeasible Inference in Natural Language , 2020, FINDINGS.

[64]  Ashish Sabharwal,et al.  Probing Natural Language Inference Models through Semantic Fragments , 2019, AAAI.

[65]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[66]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[67]  Samuel R. Bowman,et al.  Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options , 2020, AACL.

[68]  Hao Tan,et al.  The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions , 2020, EMNLP.

[69]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[70]  Adam Poliak,et al.  A survey on Recognizing Textual Entailment as an NLP Evaluation , 2020, EVAL4NLP.

[71]  Benjamin Van Durme,et al.  Uncertain Natural Language Inference , 2019, ACL.

[72]  Mohit Bansal,et al.  ConjNLI: Natural Language Inference over Conjunctive Sentences , 2020, EMNLP.

[73]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[74]  Samuel R. Bowman,et al.  Collecting Entailment Data for Pretraining: New Protocols and Negative Results , 2020, EMNLP.

[75]  Samuel R. Bowman,et al.  Collecting Entailment Data for Pretraining: New Protocols and Negative Results , 2020, EMNLP.

[76]  Tal Linzen,et al.  COGS: A Compositional Generalization Challenge Based on Semantic Interpretation , 2020, EMNLP.

[77]  Samuel R. Bowman,et al.  BLiMP: A Benchmark of Linguistic Minimal Pairs for English , 2019, SCIL.

[78]  Christopher Potts,et al.  Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation , 2020, BLACKBOXNLP.

[79]  Yejin Choi,et al.  Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics , 2020, EMNLP.

[80]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[81]  Joelle Pineau,et al.  UnNatural Language Inference , 2020, ACL.

[82]  Jordan Boyd-Graber,et al.  Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards? , 2021, ACL.

[83]  Hanna M. Wallach,et al.  Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets , 2021, ACL.

[84]  Douwe Kiela,et al.  Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , 2021, EMNLP.

[85]  Zhiyi Ma,et al.  Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.

[86]  Clara Vania,et al.  What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks? , 2021, ACL.

[87]  Robin Jia,et al.  Analyzing Dynamic Adversarial Training Data in the Limit , 2021, ArXiv.

[88]  Zhiyi Ma,et al.  Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking , 2021, NeurIPS.

[89]  Samuel R. Bowman,et al.  Does Putting a Linguist in the Loop Improve NLU Data Collection? , 2021, EMNLP.