With Little Power Comes Great Responsibility

Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point. To improve the situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.

[1]  F. J. Anscombe,et al.  Fixed-Sample-Size Analysis of Sequential Observations , 1954 .

[2]  Jacob Cohen,et al.  The statistical power of abnormal-social psychological research: a review. , 1962, Journal of abnormal and social psychology.

[3]  James J Schlesselman Case-Control Studies: Design, Conduct, Analysis , 1982 .

[4]  S. Duffy Asymptotic and Exact Power for the McNemar Test and its Analogue with R Controls per Case , 1984 .

[5]  Sample size and power for pair-matched case-control studies. , 1987, Statistics in medicine.

[6]  Helena Chmura Kraemer,et al.  How Many Subjects? Statistical Power Analysis in Research , 1987 .

[7]  Stephen Dubin How many subjects? Statistical power analysis in research , 1990 .

[8]  J. Shuster,et al.  The 2 x 2 matched-pairs trial: exact unconditional design and analysis. , 1991, Biometrics.

[9]  P A Lachenbruch,et al.  On the sample size for studies based upon McNemar's test. , 1992, Statistics in medicine.

[10]  C. Begg,et al.  Operating characteristics of a rank correlation test for publication bias. , 1994, Biometrics.

[11]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[12]  Jeffrey D. Scargle,et al.  Publication Bias: The “File-Drawer” Problem in Scientific Inference , 2000 .

[13]  J. Hoenig,et al.  Statistical Practice The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis , 2001 .

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[16]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[17]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[18]  Stefan Riezler,et al.  On Some Pitfalls in Automatic Evaluation and Significance Testing for MT , 2005, IEEvaluation@ACL.

[19]  Roy Bar-Haim,et al.  The Second PASCAL Recognising Textual Entailment Challenge , 2006 .

[20]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[21]  E. Wagenmakers A practical solution to the pervasive problems ofp values , 2007, Psychonomic bulletin & review.

[22]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[23]  Hua Ai,et al.  Comparing Spoken Dialog Corpora Collected with Recruited Subjects versus Real Users , 2007, SIGDIAL.

[24]  Daniel James O'Keefe,et al.  Brief Report: Post Hoc Power, Observed Power, A Priori Power, Retrospective Power, Prospective Power, Achieved Power: Sorting Out Appropriate Uses of Statistical Power Analyses , 2007 .

[25]  Joseph Hilbe,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2009 .

[26]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[27]  Peter Clark,et al.  The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.

[28]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[29]  Dan Klein,et al.  An Empirical Investigation of Statistical Significance in NLP , 2012, EMNLP.

[30]  Morten Wang Fagerland,et al.  The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional , 2013, BMC Medical Research Methodology.

[31]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[32]  D. Barr,et al.  Random effects structure for confirmatory hypothesis testing: Keep it maximal. , 2013, Journal of memory and language.

[33]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[34]  Timothy Baldwin,et al.  Randomized Significance Tests in Machine Translation , 2014, WMT@ACL.

[35]  Dirk Hovy,et al.  What’s in a p-value in NLP? , 2014, CoNLL.

[36]  Colin Cherry,et al.  A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU , 2014, WMT@ACL.

[37]  D. A. Kenny,et al.  Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. , 2014, Journal of experimental psychology. General.

[38]  Georgios N. Yannakakis,et al.  Ratings are Overrated! , 2015, Front. ICT.

[39]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[40]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[41]  J. Ioannidis,et al.  Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature , 2017, PLoS biology.

[42]  John P. A. Ioannidis,et al.  The Power of Bias in Economics Research , 2017 .

[43]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[44]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[45]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[46]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[47]  Samuel R. Bowman,et al.  Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks , 2018, ArXiv.

[48]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[49]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[50]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[51]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[52]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[53]  Roy Schwartz,et al.  Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.

[54]  J. Ioannidis What Have We (Not) Learnt from Millions of Scientific Papers with P Values? , 2019, The American Statistician.

[55]  David Gal,et al.  Abandon Statistical Significance , 2017, The American Statistician.

[56]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[57]  Qiaozhu Mei,et al.  Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation , 2019, EMNLP.

[58]  Matthias G. Arend,et al.  Statistical power in two-level models: A tutorial based on Monte Carlo simulation. , 2019, Psychological methods.

[59]  A. Gelman Don't Calculate Post-hoc Power Using Observed Estimate of Effect Size. , 2018, Annals of surgery.

[60]  A. Gelman,et al.  The garden of forking paths : Why multiple comparisons can be a problem , even when there is no “ fishing expedition ” or “ p-hacking ” and the research hypothesis was posited ahead of time ∗ , 2019 .

[61]  Albert Gatt,et al.  Best practices for the human evaluation of automatically generated text , 2019, INLG.

[62]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[63]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[64]  Quoc V. Le,et al.  BAM! Born-Again Multi-Task Networks for Natural Language Understanding , 2019, ACL.

[65]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[66]  Erfan Sadeqi Azer,et al.  Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses , 2019, ACL.

[67]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[68]  R. Thomas McCoy,et al.  BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance , 2019, BLACKBOXNLP.

[69]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[70]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[71]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[72]  J. Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.

[73]  Oren Etzioni,et al.  Green AI , 2019, Commun. ACM.