Sifting the evidence—what's wrong with significance tests?

The findings of medical research are often met with considerable scepticism, even when they have apparently come from studies with sound methodologies that have been subjected to appropriate statistical analysis. This is perhaps particularly the case with respect to epidemiological findings that suggest that some aspect of everyday life is bad for people. Indeed, one recent popular history, the medical journalist James Le Fanu's The Rise and Fall of Modern Medicine , went so far as to suggest that the solution to medicine's ills would be the closure of all departments of epidemiology.1 One contributory factor is that the medical literature shows a strong tendency to accentuate the positive; positive outcomes are more likely to be reported than null results.2–4 By this means alone a host of purely chance findings will be published, as by conventional reasoning examining 20 associations will produce one result that is “significant at P=0.05” by chance alone. If only positive findings are published then they may be mistakenly considered to be of importance rather than being the necessary chance results produced by the application of criteria for meaningfulness based on statistical significance. As many studies contain long questionnaires collecting information on hundreds of variables, and measure a wide range of potential outcomes, several false positive findings are virtually guaranteed. The high volume and often contradictory nature5 of medical research findings, however, is not only because of publication bias. A more fundamental problem is the widespread misunderstanding of the nature of statistical significance. #### Summary points P values, or significance levels, measure the strength of the evidence against the null hypothesis; the smaller the P value, the stronger the evidence against the null hypothesis An arbitrary division of results, into “significant” or “non-significant” according to the P value, was not the intention of the …

[1]  K. Dickersin,et al.  Factors influencing publication of research results. Follow-up of applications submitted to two institutional review boards. , 1992, JAMA.

[2]  S. Goodman,et al.  p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. , 1993, American journal of epidemiology.

[3]  J. Morris,et al.  Uses of Epidemiology* , 1955, British medical journal.

[4]  P. Easterbrook,et al.  Publication bias in clinical research , 1991, The Lancet.

[5]  M. Gardner,et al.  Statistical guidelines for contributors to medical journals. , 1983, British medical journal.

[6]  A. W. Kemp,et al.  Medical Uses of Statistics. , 1994 .

[7]  T C Chalmers,et al.  The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of 71 "negative" trials. , 1978, The New England journal of medicine.

[8]  W. Browner,et al.  Are all significant P values created equal? The analogy between diagnostic tests and clinical research. , 1987, JAMA.

[9]  M. Smithson Statistics with confidence , 2000 .

[10]  Thomas A. Louis,et al.  An Assessment of Publication Bias Using a Sample of Published Clinical Trials , 1989 .

[11]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[12]  J. Berger,et al.  Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence , 1987 .

[13]  M. Oakes,et al.  Statistical Inference , 1990 .

[14]  M. Gardner,et al.  Confidence intervals rather than P values: estimation rather than hypothesis testing. , 1986, British medical journal.

[15]  A. Phillips,et al.  The design of prospective epidemiological studies: more subjects or better measurements? , 1993, Journal of clinical epidemiology.

[16]  M J Campbell,et al.  Clinical significance not statistical significance: a simple Bayesian alternative to p values. , 1998, Journal of epidemiology and community health.

[17]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[18]  P. Hopkins,et al.  Identification and relative weight of cardiovascular risk factors. , 1986, Cardiology clinics.

[19]  D. Moher,et al.  Statistical power, sample size, and their reporting in randomized controlled trials. , 1994, JAMA.

[20]  Frank Yates,et al.  The Influence of Statistical Methods for Research Workers on the Development of the Science of Statistics , 1951 .

[21]  E. Lehmann The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? , 1993 .

[22]  Joseph Berkson Tests of significance considered as evidence , 2003 .

[23]  W. W. Rozeboom The fallacy of the null-hypothesis significance test. , 1960, Psychological bulletin.

[24]  A R Feinstein,et al.  P-values and confidence intervals: two sides of the same unsatisfactory coin. , 1998, Journal of clinical epidemiology.

[25]  A R Feinstein,et al.  A collection of 56 topics with contradictory results in case-control research. , 1988, International journal of epidemiology.

[26]  S. Goodman Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy , 1999, Annals of Internal Medicine.

[27]  A. Phillips,et al.  Confounding in epidemiological studies: why "independent" effects may not be all they seem. , 1992, BMJ.

[28]  James Le Fanu,et al.  The Rise and Fall of Modern Medicine , 1999 .

[29]  P. Cole,et al.  The hypothesis generating machine. , 1993, Epidemiology.

[30]  R J Lilford,et al.  For Debate: The statistical basis of public policy: a paradigm shift is overdue , 1996, BMJ.

[31]  B. Hetzel,et al.  The uses of epidemiology. , 1985, The Medical journal of Australia.

[32]  M. S. Bartlett,et al.  Statistical methods and scientific inference. , 1957 .

[33]  K J Rothman,et al.  Significance questing. , 1986, Annals of internal medicine.

[34]  P. Gøtzsche,et al.  Sample size of randomized double-blind trials 1976-1991. , 1996, Danish medical bulletin.

[35]  R Peto,et al.  Why do we need some large, simple randomized trials? , 1984, Statistics in medicine.

[36]  John W. Tukey,et al.  Statistical Methods for Research Workers , 1930, Nature.

[37]  Douglas G. Altman,et al.  Statistics with confidence: Confidence intervals and statistical guidelines . , 1990 .

[38]  L. Joseph,et al.  Placing trials in context using Bayesian analysis. GUSTO revisited by Reverend Bayes. , 1995, JAMA.

[39]  Steven Goodman Toward Evidence-Based Medical Statistics. 2: The Bayes Factor , 1999, Annals of Internal Medicine.

[40]  G. Smith,et al.  Meta-analysis: Potentials and promise , 1997, BMJ.

[41]  J. Bond,et al.  Detection and Surveillance of Colorectal Cancer-Reply , 1990 .

[42]  R Fisher,et al.  Design of Experiments , 1936 .

[43]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[44]  D. Cox,et al.  Statistical significance tests. , 1982, British journal of clinical pharmacology.

[45]  J. Danesh,et al.  Chlamydia pneumoniae IgG titres and coronary heart disease: prospective study and meta-analysis. , 2000, BMJ : British Medical Journal.