论文信息 - ANLIzing the Adversarial Natural Language Inference Dataset - 字舞流文

ANLIzing the Adversarial Natural Language Inference Dataset

We perform an in-depth error analysis of Adversarial NLI (ANLI), a recently introduced large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds. We propose a fine-grained annotation scheme of the different aspects of inference that are responsible for the gold classification labels, and use it to hand-code all three of the ANLI development sets. We use these annotations to answer a variety of interesting questions: which inference types are most common, which models have the highest performance on each reasoning type, and which types are the most challenging for state of-the-art models? We hope that our annotations will enable more fine-grained evaluation of models trained on ANLI, provide us with a deeper understanding of where models fail and succeed, and help us determine how to train better models in future.

Adina Williams | Douwe Kiela | Tristan Thrush | Adina Williams | Douwe Kiela | Tristan Thrush

[1] Jacob Cohen. A Coefficient of Agreement for Nominal Scales , 1960 .

[2] J. R. Landis,et al. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. , 1977, Biometrics.

[3] László Dezsö,et al. Universal Grammar , 1981, Certainty in Action.

[4] R. Chaffin,et al. Cognitive and Psychometric Analysis of Analogical Problem Solving , 1990 .

[5] Stephen Pulman,et al. Using the Framework , 1996 .

[6] Yaroslav Fyodorov,et al. A Natural Logic Inference System , 2000 .

[7] Siobhan Chapman. Logic and Conversation , 2005 .

[8] Martha Palmer,et al. Verbnet: a broad-coverage, comprehensive verb lexicon , 2005 .

[9] Ido Dagan,et al. The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[10] David R. Thomas,et al. A General Inductive Approach for Analyzing Qualitative Evaluation Data , 2006 .

[11] Jan-Willem Strijbos,et al. Content analysis: What are they talking about? , 2006, Comput. Educ..

[12] Brendan T. O'Connor,et al. Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[13] Dan Roth,et al. “Ask Not What Textual Entailment Can Do for You...” , 2010, ACL.

[14] Alexander Yates,et al. Types of Common-Sense Knowledge Needed for Recognizing Textual Entailment , 2011, ACL.

[15] Johan Bos,et al. Developing a large semantically annotated corpus , 2012, LREC.

[16] Saif Mohammad,et al. SemEval-2012 Task 2: Measuring Degrees of Relational Similarity , 2012, *SEMEVAL.

[17] M. McHugh. Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[18] Ido Dagan,et al. Semantic Annotation for Textual Entailment Recognition , 2012, MICAI.

[19] Johan Bos,et al. The Groningen Meaning Bank , 2013, JSSP.

[20] Christopher Potts,et al. A large annotated corpus for learning natural language inference , 2015, EMNLP.

[21] Jason Weston,et al. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[22] Kevin Duh,et al. Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework , 2017, IJCNLP.

[23] Aaron Steven White,et al. The role of veridicality and factivity in clause selection * , 2017 .

[24] Percy Liang,et al. Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[25] Carolyn Penstein Rosé,et al. Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[26] Omer Levy,et al. Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[27] Yoav Goldberg,et al. Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[28] A. Joubert,et al. The JeuxDeMots Project is 10 Years Old: What We have Learned , 2018 .

[29] Samuel R. Bowman,et al. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[30] Rachel Rudinger,et al. Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.

[31] Shachar Mirkin,et al. Listening Comprehension over Argumentative Content , 2018, EMNLP.

[32] Rachel Rudinger,et al. Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[33] Percy Liang,et al. Transforming Question Answering Datasets Into Natural Language Inference Datasets , 2018, ArXiv.

[34] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[35] Siddharth Patwardhan,et al. Annotating Electronic Medical Records for Question Answering , 2018, ArXiv.

[36] Masatoshi Tsuchiya,et al. Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[37] Andreas Vlachos,et al. FEVER: a Large-scale Dataset for Fact Extraction and VERification , 2018, NAACL.

[38] Peter Clark. What Knowledge is Needed to Solve the RTE5 Textual Entailment Challenge? , 2018, ArXiv.

[39] Christopher Potts,et al. Stress-Testing Neural Models of Natural Language Inference with Multiply-Quantified Sentences , 2018, ArXiv.

[40] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[41] Yoav Goldberg,et al. Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[42] Jean-Philippe Bernardy,et al. What Kind of Natural Language Inference are NLP Systems Learning: Is this Enough? , 2019, ICAART.

[43] Carolyn Penstein Rosé,et al. EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference , 2019, CoNLL.

[44] Johan Bos,et al. HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning , 2019, *SEMEVAL.

[45] R. Thomas McCoy,et al. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[46] Ido Dagan,et al. Diversify Your Datasets: Analyzing Generalization via Controlled Variance in Adversarial Datasets , 2019, CoNLL.

[47] Mohit Bansal,et al. Analyzing Compositionality-Sensitivity of NLI Models , 2018, AAAI.

[48] Ellie Pavlick,et al. Inherent Disagreements in Human Textual Inferences , 2019, Transactions of the Association for Computational Linguistics.

[49] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[50] Samuel R. Bowman,et al. Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[51] Gabriel Stanovsky,et al. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[52] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[53] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[54] Grzegorz Chrupala,et al. Analyzing and interpreting neural networks for NLP: A report on the first BlackboxNLP workshop , 2019, Natural Language Engineering.

[55] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[56] Yuxing Chen,et al. Harnessing the linguistic signal to predict scalar inferences , 2019, ACL.

[57] Adina Williams,et al. Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition , 2020, ACL.

[58] Eduardo Blanco,et al. An Analysis of Natural Language Inference Benchmarks through the Lens of Negation , 2020, EMNLP.

[59] Xiang Zhou,et al. What Can We Learn from Collective Human Opinions on Natural Language Inference Data? , 2020, EMNLP.

[60] Julian Michael,et al. AmbigQA: Answering Ambiguous Open-domain Questions , 2020, EMNLP.

[61] Noah A. Smith,et al. Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[62] Yejin Choi,et al. Commonsense Reasoning for Natural Language Processing , 2020, ACL.

[63] Yejin Choi,et al. Thinking Like a Skeptic: Defeasible Inference in Natural Language , 2020, FINDINGS.

[64] Ashish Sabharwal,et al. Probing Natural Language Inference Models through Semantic Fragments , 2019, AAAI.

[65] Quoc V. Le,et al. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[66] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[67] Samuel R. Bowman,et al. Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options , 2020, AACL.

[68] Hao Tan,et al. The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions , 2020, EMNLP.

[69] J. Weston,et al. Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[70] Adam Poliak,et al. A survey on Recognizing Textual Entailment as an NLP Evaluation , 2020, EVAL4NLP.

[71] Benjamin Van Durme,et al. Uncertain Natural Language Inference , 2019, ACL.

[72] Mohit Bansal,et al. ConjNLI: Natural Language Inference over Conjunctive Sentences , 2020, EMNLP.

[73] Doug Downey,et al. Abductive Commonsense Reasoning , 2019, ICLR.

[74] Samuel R. Bowman,et al. Collecting Entailment Data for Pretraining: New Protocols and Negative Results , 2020, EMNLP.

[75] Samuel R. Bowman,et al. Collecting Entailment Data for Pretraining: New Protocols and Negative Results , 2020, EMNLP.

[76] Tal Linzen,et al. COGS: A Compositional Generalization Challenge Based on Semantic Interpretation , 2020, EMNLP.

[77] Samuel R. Bowman,et al. BLiMP: A Benchmark of Linguistic Minimal Pairs for English , 2019, SCIL.

[78] Christopher Potts,et al. Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation , 2020, BLACKBOXNLP.

[79] Yejin Choi,et al. Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics , 2020, EMNLP.

[80] Jianfeng Gao,et al. DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[81] Joelle Pineau,et al. UnNatural Language Inference , 2020, ACL.

[82] Jordan Boyd-Graber,et al. Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards? , 2021, ACL.

[83] Hanna M. Wallach,et al. Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets , 2021, ACL.

[84] Douwe Kiela,et al. Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , 2021, EMNLP.

[85] Zhiyi Ma,et al. Dynabench: Rethinking Benchmarking in NLP , 2021, NAACL.

[86] Clara Vania,et al. What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks? , 2021, ACL.

[87] Robin Jia,et al. Analyzing Dynamic Adversarial Training Data in the Limit , 2021, ArXiv.

[88] Zhiyi Ma,et al. Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking , 2021, NeurIPS.

[89] Samuel R. Bowman,et al. Does Putting a Linguist in the Loop Improve NLU Data Collection? , 2021, EMNLP.