What Have We (Not) Learnt from Millions of Scientific Papers with P Values?

ABSTRACT P values linked to null hypothesis significance testing (NHST) is the most widely (mis)used method of statistical inference. Empirical data suggest that across the biomedical literature (1990–2015), when abstracts use P values 96% of them have P values of 0.05 or less. The same percentage (96%) applies for full-text articles. Among 100 articles in PubMed, 55 report P values, while only 4 present confidence intervals for all the reported effect sizes, none use Bayesian methods and none use false-discovery rate. Over 25 years (1990–2015), use of P values in abstracts has doubled for all PubMed, and tripled for meta-analyses, while for some types of designs such as randomized trials the majority of abstracts report P values. There is major selective reporting for P values. Abstracts tend to highlight most favorable P values and inferences use even further spin to reach exaggerated, unreliable conclusions. The availability of large-scale data on P values from many papers has allowed the development and applications of methods that try to detect and model selection biases, for example, p-hacking, that cause patterns of excess significance. Inferences need to be cautious as they depend on the assumptions made by these models and can be affected by the presence of other biases (e.g., confounding in observational studies). While much of the unreliability of past and present research is driven by small, underpowered studies, NHST with P values may be also particularly problematic in the era of overpowered big data. NHST and P values are optimal only in a minority of current research. Using a more stringent threshold, as in the recently proposed shift from P < 0.05 to P < 0.005, is a temporizing measure to contain the flood and death-by-significance. NHST and P values may be replaced in many fields by other, more fit-for-purpose, inferential methods. However, curtailing selection biases requires additional measures, beyond changes in inferential methods, and in particular reproducible research practices.

[1]  Thomas A Trikalinos,et al.  Early extreme contradictory estimates may appear in published research: the Proteus phenomenon in molecular genetics research and randomized trials. , 2005, Journal of clinical epidemiology.

[2]  Peter C Gøtzsche,et al.  Believability of relative risks and odds ratios in abstracts: cross sectional study , 2006, BMJ : British Medical Journal.

[3]  J. Ioannidis,et al.  Why Current Publication Practices May Distort Science , 2008, PLoS medicine.

[4]  D. Fanelli “Positive” Results Increase Down the Hierarchy of the Sciences , 2010, PloS one.

[5]  J. Ioannidis,et al.  Quantifying Selective Reporting and the Proteus Phenomenon for Multiple Datasets with Similar Bias , 2011, PloS one.

[6]  C. Chambers Registered Reports: A new publishing initiative at Cortex , 2013, Cortex.

[7]  Madian Khabsa,et al.  Digital commons , 2020, Internet Policy Rev..

[8]  A. Gelman,et al.  The statistical crisis in science , 2014 .

[9]  Leif D. Nelson,et al.  P-Curve: A Key to the File Drawer , 2013, Journal of experimental psychology. General.

[10]  Sally Hopewell,et al.  Impact of spin in the abstracts of articles reporting results of randomized controlled trials in the field of cancer: the SPIIN randomized controlled trial. , 2014, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[11]  John P A Ioannidis,et al.  Assessment of vibration of effects due to model specification can demonstrate the instability of observational associations. , 2015, Journal of clinical epidemiology.

[12]  Denes Szucs,et al.  Correction: Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature , 2016, bioRxiv.

[13]  John P. A. Ioannidis,et al.  p-Curve and p-Hacking in Observational Research , 2016, PloS one.

[14]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[15]  Denes Szucs,et al.  A Tutorial on Hunting Statistical Significance by Chasing N , 2016, Front. Psychol..

[16]  J. Ioannidis,et al.  When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment , 2016, bioRxiv.

[17]  John P. A. Ioannidis,et al.  Systematic assessment of pharmaceutical prescriptions in association with cancer risk: a method to conduct a population-wide medication-wide longitudinal study , 2016, Scientific Reports.

[18]  J. Ioannidis,et al.  Evolution of Reporting P Values in the Biomedical Literature, 1990-2015. , 2016, JAMA.

[19]  Christopher D. Chambers,et al.  Redefine statistical significance , 2017, Nature Human Behaviour.

[20]  John P. A. Ioannidis,et al.  Meta-assessment of bias in science , 2017, Proceedings of the National Academy of Sciences.

[21]  J. Ioannidis,et al.  When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment , 2016, bioRxiv.

[22]  John P. A. Ioannidis,et al.  A manifesto for reproducible science , 2017, Nature Human Behaviour.

[23]  J. Ioannidis,et al.  Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature , 2017, PLoS biology.

[24]  Stéphane Bermon,et al.  Serum androgen levels and their relation to performance in track and field: mass spectrometry results from 2127 observations in male and female elite athletes , 2017, British Journal of Sports Medicine.

[25]  J. Ioannidis,et al.  A simulation study of the strength of evidence in the recommendation of medications based on two trials with statistically significant results , 2017, PloS one.

[26]  J. Ioannidis The Proposal to Lower P Value Thresholds to .005. , 2018, JAMA.

[27]  J. Ioannidis,et al.  P values in display items are ubiquitous and almost invariably significant: A survey of top science journals , 2018, PloS one.