The reproducibility of research and the misinterpretation of p-values

We wish to answer this question: If you observe a ‘significant’ p-value after doing a single unbiased experiment, what is the probability that your result is a false positive? The weak evidence provided by p-values between 0.01 and 0.05 is explored by exact calculations of false positive risks. When you observe p = 0.05, the odds in favour of there being a real effect (given by the likelihood ratio) are about 3 : 1. This is far weaker evidence than the odds of 19 to 1 that might, wrongly, be inferred from the p-value. And if you want to limit the false positive risk to 5%, you would have to assume that you were 87% sure that there was a real effect before the experiment was done. If you observe p = 0.001 in a well-powered experiment, it gives a likelihood ratio of almost 100 : 1 odds on there being a real effect. That would usually be regarded as conclusive. But the false positive risk would still be 8% if the prior probability of a real effect were only 0.1. And, in this case, if you wanted to achieve a false positive risk of 5% you would need to observe p = 0.00045. It is recommended that the terms ‘significant’ and ‘non-significant’ should never be used. Rather, p-values should be supplemented by specifying the prior probability that would be needed to produce a specified (e.g. 5%) false positive risk. It may also be helpful to specify the minimum false positive risk associated with the observed p-value. Despite decades of warnings, many areas of science still insist on labelling a result of p < 0.05 as ‘statistically significant’. This practice must contribute to the lack of reproducibility in some areas of science. This is before you get to the many other well-known problems, like multiple comparisons, lack of randomization and p-hacking. Precise inductive inference is impossible and replication is the only way to be sure. Science is endangered by statistical misunderstanding, and by senior people who impose perverse incentives on scientists.

[1]  J. Brooks Why most published research findings are false: Ioannidis JP, Department of Hygiene and Epidemiology, University of Ioannina School of Medicine, Ioannina, Greece , 2008 .

[2]  S. Goodman,et al.  p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. , 1993, American journal of epidemiology.

[3]  The Bayesian interpretation of a P-value depends only weakly on statistical power in realistic situations. , 2009, Journal of clinical epidemiology.

[4]  Harvey Goldstein,et al.  League Tables and Their Limitations: Statistical Issues in Comparisons of Institutional Performance , 1996 .

[5]  David Colquhoun,et al.  The reproducibility of research and the misinterpretation of P values , 2017, bioRxiv.

[6]  L. Held,et al.  On p-Values and Bayes Factors , 2018 .

[7]  [Testing Precise Hypotheses]: Comment , 1987 .

[8]  A. Edwards Likelihood (Expanded Edition) , 1972 .

[9]  Ben Goldacre,et al.  Why clinical trial outcomes fail to translate into benefits for patients , 2017, Trials.

[10]  Paul A Insel,et al.  Experimental design and analysis and their reporting: new guidance for publication in BJP , 2015, British journal of pharmacology.

[11]  L. Held Reverse-Bayes analysis of two common misinterpretations of significance tests , 2013, Clinical trials.

[12]  S. Goodman,et al.  Toward Evidence-Based Medical Statistics. 2: The Bayes Factor , 1999, Annals of Internal Medicine.

[13]  David Colquhoun How to get good science. , 2008 .

[14]  A G Hawkes,et al.  The quality of maximum likelihood estimates of ion channel rate constants , 2003, The Journal of physiology.

[15]  D. Lindley A STATISTICAL PARADOX , 1957 .

[16]  James O. Berger,et al.  Why should clinicians care about Bayesian methods , 2001 .

[17]  D. Bakan,et al.  The test of significance in psychological research. , 1966, Psychological bulletin.

[18]  V. Johnson Revised standards for statistical evidence , 2013, Proceedings of the National Academy of Sciences.

[19]  D. Colquhoun The False Positive Risk: A Proposal Concerning What to Do About p-Values , 2018, The American Statistician.

[20]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[21]  J. Berger,et al.  Testing Precise Hypotheses , 1987 .

[22]  J. Berger,et al.  Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence , 1987 .

[23]  Ognjen Arandjelović,et al.  A more principled use of the p-value? Not so fast: a critique of Colquhoun’s argument , 2019, Royal Society Open Science.

[24]  David Firth,et al.  Statistical modelling of citation exchange between statistics journals , 2013, Journal of the Royal Statistical Society. Series A,.

[25]  S. Goodman Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy , 1999, Annals of Internal Medicine.

[26]  Eric R. Ziegel,et al.  Statistical Issues in Drug Development , 1997 .

[27]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[28]  Ben Calderhead,et al.  Bayesian Statistical Inference in Ion-Channel Models with Exact Missed Event Correction. , 2016, Biophysical journal.

[29]  G. Casella,et al.  Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem , 1987 .

[30]  M. J. Bayarri,et al.  Calibration of ρ Values for Testing Precise Null Hypotheses , 2001 .

[31]  Joseph Berkson Tests of significance considered as evidence , 2003 .

[32]  A. Hawkes,et al.  Joint distributions of apparent open and shut times of single-ion channels and maximum likelihood fitting of mechanisms , 1996, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[33]  Christopher D. Chambers,et al.  Redefine statistical significance , 2017, Nature Human Behaviour.

[34]  Nathaniel Rothman,et al.  Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. , 2004, Journal of the National Cancer Institute.

[35]  David J. Spiegelhalter,et al.  The ASA's p‐value statement, one year on , 2017 .

[36]  David Colquhoun,et al.  An investigation of the false discovery rate and the misinterpretation of p-values , 2014, Royal Society Open Science.

[37]  Elizabeth Gilbert,et al.  Reproducibility Project: Results (Part of symposium called "The Reproducibility Project: Estimating the Reproducibility of Psychological Science") , 2014 .

[38]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[39]  V. Johnson UNIFORMLY MOST POWERFUL BAYESIAN TESTS. , 2013, Annals of statistics.

[40]  Sander Greenland,et al.  Scientists rise up against statistical significance , 2019, Nature.

[41]  Richard McElreath,et al.  The natural selection of bad science , 2016, Royal Society Open Science.

[42]  Eleanor H. Simpson,et al.  Faculty Opinions recommendation of Power failure: why small sample size undermines the reliability of neuroscience. , 2013 .