How to Tell When a Result Will Replicate: Significance and Replication in Distributional Null Hypothesis Tests

There is a well-known problem in Null Hypothesis Significance Testing: many statistically significant results fail to replicate in subsequent experiments. We show that this problem arises because standard ‘point-form null’ significance tests consider only within-experiment but ignore between-experiment variation, and so systematically underestimate the degree of random variation in results. We give an extension to standard significance testing that addresses this problem by analysing both withinand between-experiment variation. This ‘distributional null’ approach does not underestimate experimental variability and so is not overconfident in identifying significance; because this approach addresses betweenexperiment variation, it gives mathematically coherent estimates for the probability of replication of significant results. Using a large-scale replication dataset (the first ‘Many Labs’ project), we show that many experimental results that appear statistically significant in standard tests are in fact consistent with random variation when both withinand between-experiment variation are taken into account in this approach. Further, grouping experiments in this dataset into ‘predictor-target’ pairs we show that the predicted replication probabilities for target experiments produced in this approach (given predictor experiment results and the sample sizes of the two experiments) are strongly correlated with observed replication rates. Distributional null hypothesis testing thus gives researchers a statistical tool for identifying statistically significant and reliably replicable results.

[1]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[2]  Amy C. Orben,et al.  Crud (Re)Defined , 2019, Advances in methods and practices in psychological science.

[3]  David Gal,et al.  Abandon Statistical Significance , 2017, The American Statistician.

[4]  Reginald B. Adams,et al.  Many Labs 2: Investigating Variation in Replicability Across Sample and Setting , 2018 .

[5]  Brian A. Nosek,et al.  Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 , 2018, Nature Human Behaviour.

[6]  Richard D. Morey,et al.  Baysefactor: Computation of Bayes Factors for Common Designs , 2018 .

[7]  Sander Greenland,et al.  Remove, rather than redefine, statistical significance , 2017, Nature Human Behaviour.

[8]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[9]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[10]  Mark W. Lipsey,et al.  Translating the Statistical Representation of the Effects of Education Interventions Into More Readily Interpretable Forms , 2012 .

[11]  Jennifer J. Richler,et al.  Effect size estimates: current use, calculations, and interpretation. , 2012, Journal of experimental psychology. General.

[12]  Jeff Miller,et al.  Aggregate and individual replication probability within an explicit model of the research process. , 2011, Psychological methods.

[13]  Jeffrey N. Rouder,et al.  Bayes factor approaches for testing interval null hypotheses. , 2011, Psychological methods.

[14]  M. Maraun,et al.  Killeen's (2005) p rep coefficient: logical and mathematical problems. , 2010, Psychological methods.

[15]  M. Lee,et al.  A model-averaging approach to replication: the case of p rep. , 2010, Psychological methods.

[16]  Christopher G. Small,et al.  Expansions and Asymptotics for Statistics , 2010 .

[17]  M. Lee,et al.  The random effects prep continues to mispredict the probability of replication , 2010 .

[18]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[19]  Jeff Miller What is the probability of replicating a statistically significant effect? , 2009, Psychonomic bulletin & review.

[20]  Michael D. Lee,et al.  prep: an agony in five fits , 2009 .

[21]  Jeffrey N. Rouder,et al.  Bayesian t tests for accepting and rejecting the null hypothesis , 2009, Psychonomic bulletin & review.

[22]  Michael D. Lee,et al.  prep misestimates the probability of replication , 2009, Psychonomic bulletin & review.

[23]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[24]  P. Killeen,et al.  BETTER STATISTICS FOR BETTER DECISIONS: REJECTING NULL HYPOTHESES STATISTICAL TESTS IN FAVOR OF REPLICATION STATISTICS. , 2007, Psychology in the schools.

[25]  P. Killeen Replicability, Confidence, and Priors , 2005, Psychological science.

[26]  R. R. Macdonald Why Replication Probabilities Depend on Prior Probability Distributions , 2005, Psychological science.

[27]  P. Killeen,et al.  An Alternative to Null-Hypothesis Significance Tests , 2005, Psychological science.

[28]  Ronald Christensen,et al.  Testing Fisher, Neyman, Pearson, and Bayes , 2005 .

[29]  J. Hunter Needed: A Ban on the Significance Test , 1997 .

[30]  J. Berger,et al.  The Intrinsic Bayes Factor for Model Selection and Prediction , 1996 .

[31]  C. Borland,et al.  Effect Size , 2019, SAGE Research Methods Foundations.

[32]  Jacob Cohen,et al.  A power primer. , 1992, Psychological bulletin.

[33]  P. Diaconis [Replication and Meta-Analysis in Parapsychology]: Comment , 1991 .

[34]  P. Meehl Appraising and Amending Theories: The Strategy of Lakatosian Defense and Two Principles that Warrant It , 1990 .

[35]  P. Meehl Why Summaries of Research on Psychological Theories are Often Uninterpretable , 1990 .

[36]  R. P. Carver The Case Against Statistical Significance Testing , 1978 .

[37]  B. Saravanos,et al.  Statistical Analysis of Experimental Data , 2008 .