The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

We find that the performance of state-of-the-art models on Natural Language Inference (NLI) and Reading Comprehension (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? (2) Where does this instability come from? (3) How should we handle this instability and what are some potential solutions? For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve. We also observe lower-than-expected correlations between the analysis validation set and standard validation set, questioning the effectiveness of the current model-selection routine. Next, to answer the second question, we give both theoretical explanations and empirical evidence regarding the source of the instability, demonstrating that the instability mainly comes from high inter-example correlations within analysis sets. Finally, for the third question, we discuss an initial attempt to mitigate the instability and suggest guidelines for future work such as reporting the decomposed variance for more interpretable results and fair comparison across models. Our code is publicly available at: this https URL

[1]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[2]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[3]  Xiang Zhou,et al.  What Can We Learn from Collective Human Opinions on Natural Language Inference Data? , 2020, EMNLP.

[4]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[5]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[6]  Marco Marelli,et al.  A SICK cure for the evaluation of compositional distributional semantic models , 2014, LREC.

[7]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[8]  Michael Bloodgood,et al.  Analysis of Stopping Active Learning based on Stabilizing Predictions , 2013, CoNLL.

[9]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[10]  Ido Dagan,et al.  Diversify Your Datasets: Analyzing Generalization via Controlled Variance in Adversarial Datasets , 2019, CoNLL.

[11]  Mike Lewis,et al.  Generative Question Answering: Learning to Answer the Whole Question , 2018, ICLR.

[12]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[13]  Carolyn Penstein Rosé,et al.  EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference , 2019, CoNLL.

[14]  Allan Jabri,et al.  Revisiting Visual Question Answering Baselines , 2016, ECCV.

[15]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[16]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[17]  Matthew C. Makel,et al.  Replications in Psychology Research , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[18]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[19]  Christopher Potts,et al.  Posing Fair Generalization Tasks for Natural Language Inference , 2019, EMNLP.

[20]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[21]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[22]  Iryna Gurevych,et al.  Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches , 2018, ArXiv.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Ali Farhadi,et al.  Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping , 2020, ArXiv.

[27]  Timothy J. Hazen,et al.  Robust Natural Language Inference Models with Example Forgetting , 2019, ArXiv.

[28]  Sebastian Riedel,et al.  Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness , 2018, NAACL.

[29]  Yoav Goldberg,et al.  Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets , 2019, EMNLP.

[30]  Mohit Bansal,et al.  Analyzing Compositionality-Sensitivity of NLI Models , 2018, AAAI.

[31]  Haohan Wang,et al.  Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual , 2019, EMNLP.

[32]  Mohit Bansal,et al.  Simple Compounded-Label Training for Fact Extraction and Verification , 2020, FEVER.

[33]  Adina Williams,et al.  Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition , 2020, ACL.

[34]  Ido Dagan,et al.  Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations , 2018, ACL.

[35]  Roy Schwartz,et al.  Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets , 2019, NAACL.

[36]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[37]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[38]  Pasquale Minervini,et al.  Adversarially Regularising Neural NLI Models to Integrate Logical Background Knowledge , 2018, CoNLL.

[39]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[40]  Niranjan Balasubramanian,et al.  The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models , 2018, ArXiv.

[41]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[42]  Yun-Nung Chen,et al.  QAInfomax: Learning Robust Question Answering System by Mutual Information Maximization , 2019, EMNLP.

[43]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[44]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[46]  R. Thomas McCoy,et al.  BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance , 2019, BLACKBOXNLP.

[47]  Iryna Gurevych,et al.  Improving Generalization by Incorporating Coverage in Natural Language Inference , 2019, ArXiv.

[48]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[49]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[50]  Noah A. Smith,et al.  Improving Natural Language Inference with a Pretrained Parser , 2019, ArXiv.

[51]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[52]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[53]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.