Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy

The past decade has seen the rise of evidence-based medicine, a movement that has focused attention on the importance of using clinical studies for empirical demonstration of the efficacy of medical interventions. Increasingly, physicians are being called on to assess such studies to help them make clinical decisions and understand the rationale behind recommended practices. This type of assessment requires an understanding of research methods that until recently was not expected of physicians. These research methods include statistical techniques used to assist in drawing conclusions. However, the methods of statistical inference in current use are not evidence-based and thus have contributed to a widespread misperception. The misperception is that absent any consideration of biological plausibility and prior evidence, statistical methods can provide a number that by itself reflects a probability of reaching erroneous conclusions. This belief has damaged the quality of scientific reasoning and discourse, primarily by making it difficult to understand how the strength of the evidence in a particular study can be related to and combined with the strength of other evidence (from other laboratory or clinical studies, scientific reasoning, or clinical experience). This results in many knowledge claims that do not stand the test of time (1, 2). A pair of articles in this issue examines this problem in some depth and proposes a partial solution. In this article, I explore the historical and logical foundations of the dominant school of medical statistics, sometimes referred to as frequentist statistics, which might be described as error-based. I explicate the logical fallacy at the heart of this system and the reason that it maintains such a tenacious hold on the minds of investigators, policymakers, and journal editors. In the second article (3), I present an evidence-based approach derived from Bayesian statistical methods, an alternative perspective that has been one of the most active areas of biostatistical development during the past 20 years. Bayesian methods have started to make inroads into medical journals; Annals, for example, has included a section on Bayesian data interpretation in its Information for Authors section since 1 July 1997. The perspective on Bayesian methods offered here will differ somewhat from that in previous presentations in other medical journals. It will focus not on the controversial use of these methods in measuring belief but rather on how they measure the weight of quantitative evidence. We will see how reporting an index called the Bayes factor (which in its simplest form is also called a likelihood ratio) instead of the P value can facilitate the integration of statistical summaries and biological knowledge and lead to a better understanding of the role of scientific judgment in the interpretation of medical research. An Example of the Problem A recent randomized, controlled trial of hydrocortisone treatment for the chronic fatigue syndrome showed a treatment effect that neared the threshold for statistical significance, P=0.06 (4). The discussion section began, hydrocortisone treatment was associated with an improvement in symptoms This is the first such study to demonstrate improvement with a drug treatment of [the chronic fatigue syndrome] (4). What is remarkable about this paper is how unremarkable it is. It is typical of many medical research reports in that a conclusion based on the findings is stated at the beginning of the discussion. Later in the discussion, such issues as biological mechanism, effect magnitude, and supporting studies are presented. But a conclusion is stated before the actual discussion, as though it is derived directly from the results, a mere linguistic transformation of P=0.06. This is a natural consequence of a statistical method that has almost eliminated our ability to distinguish between statistical results and scientific conclusions. We will see how this is a natural outgrowth of the P value fallacy. Philosophical Preliminaries To begin our exploration of the P value fallacy, we must consider the basic elements of reasoning. The process that we use to link underlying knowledge to the observed world is called inferential reasoning, of which there are two logical types: deductive inference and inductive inference. In deductive inference, we start with a given hypothesis (a statement about how nature works) and predict what we should see if that hypothesis were true. Deduction is objective in the sense that the predictions about what we will see are always true if the hypotheses are true. Its problem is that we cannot use it to expand our knowledge beyond what is in the hypotheses. Inductive inference goes in the reverse direction: On the basis of what we see, we evaluate what hypothesis is most tenable. The concept of evidence is inductive; it is a measure that reflects back from observations to an underlying truth. The advantage of inductive reasoning is that our conclusions about unobserved states of nature are broader than the observations on which they are based; that is, we use this reasoning to generate new hypotheses and to learn new things. Its drawback is that we cannot be sure that what we conclude about nature is actually true, a conundrum known as the problem of induction (5-7). From their clinical experience, physicians are acutely aware of the subtle but critical difference between these two perspectives. Enumerating the frequency of symptoms (observations) given the known presence of a disease (hypothesis) is a deductive process and can be done by a medical student with a good medical textbook (Figure 1, top). Much harder is the inductive art of differential diagnosis: specifying the likelihood of different diseases on the basis of a patient's signs, symptoms, and laboratory results. The deductions are more certain and objective but less useful than the inductions. Figure 1. The parallels between the processes of induction and deduction in medical inference ( top ) and statistical inference ( bottom ). The identical issue arises in statistics. Under the assumption that two treatments are the same (that is, the hypothesis of no difference in efficacy is true), it is easy to calculate deductively the frequency of all possible outcomes that we could observe in a study (Figure 1, bottom). But once we observe a particular outcome, as in the result of a clinical trial, it is not easy to answer the more important inductive question, How likely is it that the treatments are equivalent? In this century, philosophers have grappled with the problem of induction and have tried to solve or evade it in several ways. Karl Popper (8) proposed a philosophy of scientific practice that eliminated formal induction completely and used only the deductive elements of science: the prediction and falsification components. Rudolf Carnap tried an opposite strategyto make the inductive component as logically secure as the deductive part (9, 10). Both were unsuccessful in producing workable models for how science could be conducted, and their failures showed that there is no methodologic solution to the problem of fallible scientific knowledge. Determining which underlying truth is most likely on the basis of the data is a problem in inverse probability, or inductive inference, that was solved quantitatively more than 200 years ago by the Reverend Thomas Bayes. He withheld his discovery, now known as Bayes theorem; it was not divulged until 1762, 20 years after his death (11). Figure 2 shows Bayes theorem in words. Figure 2. Bayes theorem, in words. As a mathematical equation, Bayes theorem is not controversial; it serves as the foundation for analyzing games of chance and medical screening tests. However, as a model for how we should think scientifically, it is criticized because it requires assigning a prior probability to the truth of an idea, a number whose objective scientific meaning is unclear (7, 10, 12). It is speculated that this may be why Reverend Bayes chose the more dire of the publish or perish options. It is also the reason why this approach has been tarred with the subjective label and has not generally been used by medical researchers. Conventional (Frequentist) Statistical Inference Because of the subjectivity of the prior probabilities used in Bayes theorem, scientists in the 1920s and 1930s tried to develop alternative approaches to statistical inference that used only deductive probabilities, calculated with mathematical formulas that described (under certain assumptions) the frequency of all possible experimental outcomes if an experiment were repeated many times (10). Methods based on this frequentist view of probability included an index to measure the strength of evidence called the P value, proposed by R.A. Fisher in the 1920s (13), and a method for choosing between hypotheses, called a hypothesis test, developed in the early 1930s by the mathematical statisticians Jerzy Neyman and Egon Pearson (14). These two methods were incompatible but have become so intertwined that they are mistakenly regarded as part of a single, coherent approach to statistical inference (6, 15, 16). The P Value The P value is defined as the probability, under the assumption of no effect or no difference (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed (Figure 3). Fisher proposed it as an informal index to be used as a measure of discrepancy between the data and the null hypothesis. It was not part of a formal inferential method. Fisher suggested that it be used as part of the fluid, non-quantifiable process of drawing conclusions from observations, a process that included combining the P value in some unspecified way with background information (17). Figure 3. The bell-shaped curve represents the probability of every possible outcome under the null hypothesis. P P It is worth noting one widely prevalent

[1]  HighWire Press Philosophical Transactions of the Royal Society of London , 1781, The London Medical Journal.

[2]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[3]  K. Popper,et al.  The Logic of Scientific Discovery , 1960 .

[4]  E. S. Pearson "Student" as Statistician , 1939 .

[5]  Joseph Berkson,et al.  Tests of significance considered as evidence , 1942 .

[6]  Rudolf Carnap,et al.  Logical foundations of probability , 1951 .

[7]  L. Stein,et al.  Probability and the Weighing of Evidence , 1950 .

[8]  Rory A. Fisher,et al.  Statistical Methods for Research Workers. , 1956 .

[9]  M. S. Bartlett,et al.  Statistical methods and scientific inference. , 1957 .

[10]  M. Kendall Theoretical Statistics , 1956, Nature.

[11]  W. W. Rozeboom The fallacy of the null-hypothesis significance test. , 1960, Psychological bulletin.

[12]  E. S. Pearson Some Thoughts on Statistical Inference , 1962 .

[13]  F. J. Anscombe Sequential Medical Trials , 1963 .

[14]  Donald Mainland The significance of "nonsignificance" , 1963, Clinical pharmacology and therapeutics.

[15]  J. Cornfield Sequential Trials, Sequential Analysis and the Likelihood Principle , 1966 .

[16]  J. Cornfield A BAYESIAN TEST OF SOME CLASSICAL HYPOTHESES- WITH APPLICATIONS TO SEQUENTIAL CLINICAL TRIALS , 1966 .

[17]  George A. Barnard,et al.  The use of the likelihood function in statistical practice , 1967 .

[18]  J. Cornfield,et al.  On Certain Aspects of Sequential Clinical Trials , 1967 .

[19]  W. Salmon The foundations of scientific inference , 1967 .

[20]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[21]  J Cornfield,et al.  The Bayesian outlook and its application. , 1969, Biometrics.

[22]  J G Skellam,et al.  Models, inference, and strategy. , 1969, Biometrics.

[23]  B. J. Winer The Significance Test Controversy--A Reader. , 1971 .

[24]  A. Feinstein,et al.  Clinical biostatistics , 1974 .

[25]  A. W. F. Edwards,et al.  A History of Likelihood , 1974 .

[26]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[27]  Vic Barnett Comparative Statistical Inference , 1975 .

[28]  R. Galen,et al.  Beyond Normality: The Predictive Value and E ciency of Medical Diagnoses , 1975 .

[29]  Ian Hacking,et al.  The Emergence of Probability. A Philosophical Study of Early Ideas about Probability, Induction and Statistical Inference , 1979 .

[30]  Lawrence Sklar,et al.  Philosophical problems of statistical inference , 1981 .

[31]  D. Cox,et al.  Statistical significance tests. , 1982, British journal of clinical pharmacology.

[32]  J. Borak,et al.  Errors of intuitive logic among physicians. , 1982, Social science & medicine.

[33]  G. Hayden Biostatistical trends in Pediatrics: implications for the future. , 1983, Pediatrics.

[34]  W. Dupont Sequential stopping rules and sequentially adjusted P values: does one require the other? , 1983, Controlled clinical trials.

[35]  G A Diamond,et al.  Clinical trials and statistical verdicts: probable grounds for appeal. , 1983, Annals of internal medicine.

[36]  D Mainland,et al.  Statistical ritual in clinical journals: is there a cure?--I. , 1984, British medical journal.

[37]  D. Rubin Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician , 1984 .

[38]  J Siemiatycki,et al.  The problem of multiple inference in studies designed to generate hypotheses. , 1985, American journal of epidemiology.

[39]  D. Berry,et al.  Interim analyses in clinical trials: classical vs. Bayesian approaches. , 1985, Statistics in medicine.

[40]  David S. Salsburg,et al.  The Religion of Statistics as Practiced in Medical Journals , 1985 .

[41]  R. Olshen,et al.  Proceedings of the Berkeley conference in honor of Jerzy Neyman and Jack Kiefer , 1985 .

[42]  Stephen M. Stigler,et al.  The History of Statistics: The Measurement of Uncertainty before 1900 by Stephen M. Stigler (review) , 1986, Technology and Culture.

[43]  R. Simon,et al.  Confidence intervals for reporting results of clinical trials. , 1986, Annals of internal medicine.

[44]  K J Rothman,et al.  Significance questing. , 1986, Annals of internal medicine.

[45]  M. Oakes Statistical Inference: A Commentary for the Social and Behavioural Sciences , 1986 .

[46]  C Poole,et al.  Beyond the confidence interval. , 1987, American journal of public health.

[47]  J. Berger,et al.  Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence , 1987 .

[48]  W. Browner,et al.  Are all significant P values created equal? The analogy between diagnostic tests and clinical research. , 1987, JAMA.

[49]  R Peto,et al.  Why do we need systematic overviews of randomized trials? , 1987, Statistics in medicine.

[50]  H. Wulff,et al.  What do doctors know about statistics? , 1987, Statistics in medicine.

[51]  S. Pocock,et al.  Statistical problems in the reporting of clinical trials. A survey of three medical journals. , 1987, The New England journal of medicine.

[52]  G. Casella,et al.  Reconciling Bayesian and Frequentist Evidence in the One-Sided Testing Problem , 1987 .

[53]  S. Evans,et al.  The end of the p value? , 1988, British heart journal.

[54]  James O. Berger,et al.  Statistical Analysis and the Illusion of Objectivity , 1988 .

[55]  S. Goodman,et al.  Evidence and scientific research. , 1988, American journal of public health.

[56]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[57]  J C Bailar,et al.  Guidelines for statistical reporting in articles for medical journals. Amplifications and explanations. , 1988, Annals of internal medicine.

[58]  L. Braitman Confidence intervals extract clinically useful information from data. , 1988, Annals of internal medicine.

[59]  Peter Urbach,et al.  Scientific Reasoning: The Bayesian Approach , 1989 .

[60]  J. Ware Investigating Therapies of Potentially Great Benefit: ECMO , 1989 .

[61]  S. Goodman Meta-analysis and evidence. , 1989, Controlled clinical trials.

[62]  Colin B. Begg,et al.  On inferences from Wei's biased coin design for clinical trials , 1990 .

[63]  G. Shafer Savage revisited , 1990 .

[64]  K J Rothman,et al.  No Adjustments Are Needed for Multiple Comparisons , 1990, Epidemiology.

[65]  I. Tannock,et al.  How American oncologists treat breast cancer: an assessment of the influence of clinical trials. , 1991, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[66]  D. Altman,et al.  Improving Doctors' understanding of statistics , 1991 .

[67]  J M Robins,et al.  Empirical‐Bayes Adjustments for Multiple Comparisons Are Sometimes Useful , 1991, Epidemiology.

[68]  D. Altman Confidence intervals in research evaluation , 1992, ACP Journal Club.

[69]  P. Freeman,et al.  The role of p-values in analysing trial results. , 1993, Statistics in medicine.

[70]  J. Concato,et al.  The Risk of Determining Risk with Multivariable Models , 1993, Annals of Internal Medicine.

[71]  D A Savitz,et al.  Is statistical significance testing useful in interpreting data? , 1993, Reproductive toxicology.

[72]  D A Berry,et al.  A case for Bayesianism in clinical trials. , 1993, Statistics in medicine.

[73]  Thomas A. Louis,et al.  Graphical Elicitation of a Prior Distribution for a Clinical Trial , 1993 .

[74]  S. Goodman,et al.  p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. , 1993, American journal of epidemiology.

[75]  E. Lehmann The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? , 1993 .

[76]  M D Hughes,et al.  Reporting Bayesian analyses of clinical trials. , 1993, Statistics in medicine.

[77]  C. Howson,et al.  Scientific Reasoning: The Bayesian Approach , 1989 .

[78]  J. Ludbrook,et al.  Issues in biomedical statistics: statistical inference. , 1994, The Australian and New Zealand journal of surgery.

[79]  R. Serlin,et al.  Misuse of statistical test in three decades of psychotherapy research. , 1994, Journal of consulting and clinical psychology.

[80]  David J. Spiegelhalter,et al.  Bayesian Approaches to Randomized Trials , 1994, Bayesian Biostatistics.

[81]  D. G. Altman,et al.  Statistical aspects of prognostic factor studies in oncology. , 1994, British Journal of Cancer.

[82]  D G Altman,et al.  Transfer of technology from statistical journals to the biomedical literature. Past trends and future predictions. , 1994, JAMA.

[83]  L. Joseph,et al.  Placing trials in context using Bayesian analysis. GUSTO revisited by Reverend Bayes. , 1995, JAMA.

[84]  D A Berry Decision analysis and Bayesian methods in clinical trials. , 1995, Cancer treatment and research.

[85]  A. Olshan,et al.  Multiple comparisons and related issues in the interpretation of epidemiologic data. , 1995, American journal of epidemiology.

[86]  Walter R. Gilks,et al.  BUGS - Bayesian inference Using Gibbs Sampling Version 0.50 , 1995 .

[87]  Bayesian Analysis and the GUSTO Trial , 1995 .

[88]  J. Matthews,et al.  Quantification and the Quest for Medical Certainty , 1995 .

[89]  J. Kadane,et al.  Bayesian statistical methods in public health and medicine. , 1995, Annual review of public health.

[90]  J B Kadane,et al.  Prime time for Bayes. , 1995, Controlled clinical trials.

[91]  Bayesian analysis and the GUSTO trial. Global Utilization of Streptokinase and Tissue Plasminogen Activator in Occluded Arteries. , 1995, JAMA.

[92]  B. Efron Empirical Bayes Methods for Combining Likelihoods , 1996 .

[93]  D. H. Spodich "Evidence-based medicine": terminologic lapse or terminologic arrogance? , 1996, The American journal of cardiology.

[94]  Laurence S. Freedman,et al.  Bayesian statistical methods , 1996, BMJ.

[95]  I. Tannock,et al.  False-positive results in clinical trials: multiple significance tests and the problem of unreported comparisons. , 1996, Journal of the National Cancer Institute.

[96]  L D Fisher,et al.  Comments on Bayesian and frequentist analysis and interpretation of clinical trials. , 1996, Controlled clinical trials.

[97]  S G Thompson,et al.  A likelihood approach to meta-analysis with random effects. , 1996, Statistics in medicine.

[98]  R J Lilford,et al.  For Debate: The statistical basis of public policy: a paradigm shift is overdue , 1996, BMJ.

[99]  L. Hedges,et al.  Are the clinical effects of homeopathy placebo effects? A meta-analysis of placebo-controlled trials. , 1997, Lancet.

[100]  M. Barnett,et al.  Tyranny of the p-Value: The Conflict between Statistical Significance and Common Sense , 1997, Journal of dental research.

[101]  P M Fayers,et al.  Tutorial in biostatistics Bayesian data monitoring in clinical trials. , 1997, Statistics in medicine.

[102]  Wayne B Jonas,et al.  Are the clinical effects of homoeopathy placebo effects? A meta-analysis of placebo-controlled trials , 1997, The Lancet.

[103]  A. Feinstein,et al.  Problems in the "evidence" of "evidence-based medicine". , 1997, The American journal of medicine.

[104]  Bradley P. Carlin,et al.  BAYES AND EMPIRICAL BAYES METHODS FOR DATA ANALYSIS , 1996, Stat. Comput..

[105]  H. Marks,et al.  Reviews: Medicine and Health-The Progress of Experiment: Science and Therapeutic Reform in the United States, 1900-1990 , 1997 .

[106]  K. Chia,et al.  "Significant-itis"--an obsession with the P-value. , 1997, Scandinavian journal of work, environment & health.

[107]  I Chalmers,et al.  Discussion sections in reports of controlled trials published in general medical journals: islands in search of continents? , 1998, JAMA.

[108]  R. Califf,et al.  Influence of a randomized clinical trial on practice by participating investigators: lessons from the Coronary Angioplasty Versus Excisional Atherectomy Trial (CAVEAT). CAVEAT I and II Investigators. , 1998, Journal of the American College of Cardiology.

[109]  S. Yusuf,et al.  Overcoming the limitations of current meta-analysis of randomised controlled trials , 1998, The Lancet.

[110]  M. Tonelli The philosophical limits of evidence‐based medicine , 1998, Academic medicine : journal of the Association of American Medical Colleges.

[111]  K J Rothman,et al.  That confounded P-value. , 1998, Epidemiology.

[112]  T. Perneger What's wrong with Bonferroni adjustments , 1998, BMJ.

[113]  A R Feinstein,et al.  P-values and confidence intervals: two sides of the same unsatisfactory coin. , 1998, Journal of clinical epidemiology.

[114]  S. Goodman,et al.  Multiple comparisons, explained. , 1998, American journal of epidemiology.

[115]  D. Berry,et al.  Benefits and risks of screening mammography for women in their forties: a statistical appraisal. , 1998, Journal of the National Cancer Institute.

[116]  Christopher Hamlin,et al.  The Progress of Experiment: Science and Therapeutic Reform in the United States, 1900-1990 , 1998 .

[117]  M. Demitrack,et al.  Low-dose hydrocortisone for treatment of chronic fatigue syndrome: a randomized controlled trial. , 1998, JAMA.

[118]  J. Vandenbroucke,et al.  175th anniversary lecture. Medical journals and the shaping of medical knowledge. , 1998, The Lancet.

[119]  L. Moyé End-point interpretation in clinical trials: the case for discipline. , 1999, Controlled clinical trials.

[120]  E. Rimm,et al.  Relation of Consumption of Vitamin E, Vitamin C, and Carotenoids to Risk for Stroke among Men in the United States , 1999, Annals of Internal Medicine.

[121]  Steven Goodman Toward Evidence-Based Medical Statistics. 2: The Bayes Factor , 1999, Annals of Internal Medicine.

[122]  Lemuel A. Moyé,et al.  Carvedilol and the Food and Drug Administration approval process: an introduction. , 1999, Controlled clinical trials.

[123]  L. Fisher Carvedilol and the Food and Drug Administration (FDA) approval process: the FDA paradigm and reflections on hypothesis testing. , 1999, Controlled clinical trials.

[124]  T. Louis,et al.  BAYES AND EMPIRICAL BAYES METHODS FOR DATA ANALYSIS , 1996, Stat. Comput..