The ongoing tyranny of statistical significance testing in biomedical research

Since its introduction into the biomedical literature, statistical significance testing (abbreviated as SST) caused much debate. The aim of this perspective article is to review frequent fallacies and misuses of SST in the biomedical field and to review a potential way out of the fallacies and misuses associated with SSTs. Two frequentist schools of statistical inference merged to form SST as it is practised nowadays: the Fisher and the Neyman-Pearson school. The P-value is both reported quantitatively and checked against the α-level to produce a qualitative dichotomous measure (significant/nonsignificant). However, a P-value mixes the estimated effect size with its estimated precision. Obviously, it is not possible to measure these two things with one single number. For the valid interpretation of SSTs, a variety of presumptions and requirements have to be met. We point here to four of them: study size, correct statistical model, correct causal model, and absence of bias and confounding. It has been stated that the P-value is perhaps the most misunderstood statistical concept in clinical research. As in the social sciences, the tyranny of SST is still highly prevalent in the biomedical literature even after decades of warnings against SST. The ubiquitous misuse and tyranny of SST threatens scientific discoveries and may even impede scientific progress. In the worst case, misuse of significance testing may even harm patients who eventually are incorrectly treated because of improper handling of P-values. For a proper interpretation of study results, both estimated effect size and estimated precision are necessary ingredients.

[1]  Edwin G. Boring,et al.  Mathematical vs. scientific significance. , 1919 .

[2]  E. S. Pearson,et al.  ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I , 1928 .

[3]  J. I The Design of Experiments , 1936, Nature.

[4]  R. A. Fisher,et al.  Statistical methods and scientific inference. , 1957 .

[5]  Lancelot Hogben,et al.  Statistical Theory: The Relationship of Probability, Credibility, and Error , 1968 .

[6]  Rory A. Fisher,et al.  Statistical methods and scientific inference. , 1957 .

[7]  K J Rothman,et al.  A show of confidence. , 1978, The New England journal of medicine.

[8]  A fair trial? , 1984, British medical journal.

[9]  K J Rothman,et al.  Significance questing. , 1986, Annals of internal medicine.

[10]  L. Sobin,et al.  TNM Classification of Malignant Tumours , 1987, UICC International Union Against Cancer.

[11]  Geoffrey R. Loftus,et al.  On the Tyranny of Hypothesis Testing in the Social Sciences , 1991 .

[12]  Jacob Cohen The earth is round (p < .05) , 1994 .

[13]  S. Goodman,et al.  The Use of Predicted Confidence Intervals When Planning Experiments and the Misuse of Power When Interpreting Results , 1994, Annals of Internal Medicine.

[14]  D G Altman,et al.  Absence of evidence is not evidence of absence. , 1996, Australian veterinary journal.

[15]  D. Horsman,et al.  Correlation of cytogenetic abnormalities with the outcome of patients with uveal melanoma , 1998, Cancer.

[16]  K J Rothman,et al.  That confounded P-value. , 1998, Epidemiology.

[17]  Leland Wilkinson,et al.  Statistical Methods in Psychology Journals Guidelines and Explanations , 2005 .

[18]  Jonathan A C Sterne,et al.  Sifting the evidence—what's wrong with significance tests? , 2001, BMJ : British Medical Journal.

[19]  C Poole,et al.  Low P-Values or Narrow Confidence Intervals: Which Are More Durable? , 2001, Epidemiology.

[20]  Charles Kooperberg,et al.  Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results From the Women's Health Initiative randomized controlled trial. , 2002, JAMA.

[21]  Jeffrey F Peipert,et al.  What your statistician never told you about P-values. , 2003, The Journal of the American Association of Gynecologic Laparoscopists.

[22]  L. Arab,et al.  Commentary: This study failed? , 2003, International journal of epidemiology.

[23]  G. Gigerenzer Mindless statistics , 2004 .

[24]  F. Grodstein,et al.  Effects of moderate alcohol consumption on cognitive function in women. , 2005, The New England journal of medicine.

[25]  K. Rabe Treating COPD--the TORCH trial, P values, and the Dodo. , 2007, The New England journal of medicine.

[26]  Jorma Toppari,et al.  Flame Retardants in Placenta and Breast Milk and Cryptorchidism in Newborn Boys , 2007, Environmental health perspectives.

[27]  S. Goodman A dirty dozen: twelve p-value misconceptions. , 2008, Seminars in hematology.

[28]  R. Hubbard,et al.  Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing , 2008 .

[29]  J. Ware,et al.  Translating statistical findings into plain English , 2009, The Lancet.

[30]  Andrea Parisi,et al.  Theoretical Epidemiology , 2010 .