The fickle P value generates irreproducible results

The reliability and reproducibility of science are under scrutiny. However, a major cause of this lack of repeatability is not being considered: the wide sample-to-sample variability in the P value. We explain why P is fickle to discourage the ill-informed practice of interpreting analyses based predominantly on this statistic.

[1]  R. A. Fisher,et al.  Statistical methods and scientific inference. , 1957 .

[2]  S. Goldhor Ecology , 1964, The Yale Journal of Biology and Medicine.

[3]  Rory A. Fisher,et al.  Statistical methods and scientific inference. , 1957 .

[4]  David S. Salsburg,et al.  The Religion of Statistics as Practiced in Medical Journals , 1985 .

[5]  R. Rosenthal,et al.  Statistical Procedures and the Justification of Knowledge in Psychological Science , 1989 .

[6]  Geoffrey R. Loftus,et al.  A picture is worth a thousandp values: On the irrelevance of hypothesis testing in the microcomputer age , 1993 .

[7]  Jacob Cohen The earth is round (p < .05) , 1994 .

[8]  Douglas H. Johnson The Insignificance of Statistical Significance Testing , 1999 .

[9]  M. Masson Using confidence intervals for graphically based data interpretation. , 2003, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[10]  S. Maxwell The persistence of underpowered studies in psychological research: causes, consequences, and remedies. , 2004, Psychological methods.

[11]  R. Grissom,et al.  Effect Sizes for Research : Univariate and Multivariate Applications, Second Edition , 2005 .

[12]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[13]  Neil Thomason,et al.  Impact of Criticism of Null‐Hypothesis Significance Testing on Statistical Reporting Practices in Conservation Biology , 2006, Conservation biology : the journal of the Society for Conservation Biology.

[14]  I. Cuthill,et al.  Effect size, confidence interval and statistical significance: a practical guide for biologists , 2007, Biological reviews of the Cambridge Philosophical Society.

[15]  Alejandro Martínez-Abraín,et al.  Statistical significance and biological relevance: A call for a more cautious interpretation of results in ecology , 2008 .

[16]  G. Cumming Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better , 2008, Perspectives on psychological science : a journal of the Association for Psychological Science.

[17]  Joseph R. Rausch,et al.  Sample size planning for statistical power and accuracy in parameter estimation. , 2008, Annual review of psychology.

[18]  Douglas Curran-Everett,et al.  Explorations in statistics: confidence intervals. , 2009, Advances in physiology education.

[19]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[20]  Leonard A Stefanski,et al.  P-Value Precision and Reproducibility , 2011, The American statistician.

[21]  G B Drummond,et al.  Show the data, don't conceal them , 2011, The Journal of physiology.

[22]  M. Lew Bad statistical practice in pharmacology (and other basic biomedical disciplines): you probably don't know P , 2012, British journal of pharmacology.

[23]  David L. Vaux,et al.  Research methods: Know when your numbers are significant , 2012, Nature.

[24]  G. M. Allan,et al.  How confidence intervals become confusion intervals , 2013, BMC Medical Research Methodology.

[25]  D. Sharpe Why the resistance to statistical innovations? Bridging the communication gap. , 2013, Psychological methods.

[26]  Martin Krzywinski,et al.  Points of significance: Power and sample size , 2013, Nature Methods.

[27]  Jonathan F. Russell,et al.  If a job is worth doing, it is worth doing twice , 2013, Nature.

[28]  V. Johnson Revised standards for statistical evidence , 2013, Proceedings of the National Academy of Sciences.

[29]  Naomi S. Altman,et al.  Points of significance: Importance of being uncertain , 2013, Nature Methods.

[30]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[31]  G. Cumming,et al.  High Impact = High Statistical Standards? Not Necessarily So , 2013, PloS one.

[32]  Martin Krzywinski,et al.  Points of Significance: Error bars , 2013, Nature Methods.

[33]  Suzanne K. Linder,et al.  A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic , 2013, PloS one.

[34]  Michael Lavine,et al.  Comment on Murtaugh. , 2014, Ecology.

[35]  Marcia McNutt,et al.  Journals unite for reproducibility , 2014, Science.

[36]  Nicholas J Gotelli,et al.  P values, hypothesis testing, and model selection: it's déjà vu all over again. , 2014, Ecology.

[37]  Regina Nuzzo,et al.  Scientific method: Statistical errors , 2014, Nature.

[38]  G. Cumming,et al.  The New Statistics , 2014, Psychological science.

[39]  Richard Van Noorden Science joins push to screen statistics in papers , 2014 .

[40]  High retraction rates raise eyebrows , 2014, Nature.

[41]  Paul A Murtaugh,et al.  In defense of P values. , 2014, Ecology.

[42]  John Bohannon,et al.  Psychology. Replication effort provokes praise--and 'bullying' charges. , 2014, Science.