p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate.

It is not generally appreciated that the p value, as conceived by R. A. Fisher, is not compatible with the Neyman-Pearson hypothesis test in which it has become embedded. The p value was meant to be a flexible inferential measure, whereas the hypothesis test was a rule for behavior, not inference. The combination of the two methods has led to a reinterpretation of the p value simultaneously as an "observed error rate" and as a measure of evidence. Both of these interpretations are problematic, and their combination has obscured the important differences between Neyman and Fisher on the nature of the scientific method and inhibited our understanding of the philosophic implications of the basic methods in use today. An analysis using another method promoted by Fisher, mathematical likelihood, shows that the p value substantially overstates the evidence against the null hypothesis. Likelihood makes clearer the distinction between error rates and inferential evidence and is a quantitative tool for expressing evidential strength that is more appropriate for the purposes of epidemiology than the p value.

[1]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[2]  D. Lindley A STATISTICAL PARADOX , 1957 .

[3]  Leonard J. Savage,et al.  The foundations of statistical inference : a discussion , 1962 .

[4]  F. J. Anscombe Sequential Medical Trials , 1963 .

[5]  John W. Pratt,et al.  Bayesian Interpretation of Standard Inference Statements , 1965 .

[6]  J. Cornfield Sequential Trials, Sequential Analysis and the Likelihood Principle , 1966 .

[7]  M A Schneiderman,et al.  The role of hypothesis testing in clinical trials. Biometrics seminar. , 1966, Journal of chronic diseases.

[8]  J. Cornfield A BAYESIAN TEST OF SOME CLASSICAL HYPOTHESES- WITH APPLICATIONS TO SEQUENTIAL CLINICAL TRIALS , 1966 .

[9]  George A. Barnard,et al.  The use of the likelihood function in statistical practice , 1967 .

[10]  G. A. Barnard,et al.  The Bayesian Controversy in Statistical Inference , 1967 .

[11]  W. Dupont Sequential stopping rules and sequentially adjusted P values: does one require the other? , 1983, Controlled clinical trials.

[12]  G A Diamond,et al.  Clinical trials and statistical verdicts: probable grounds for appeal. , 1983, Annals of internal medicine.

[13]  J Siemiatycki,et al.  The problem of multiple inference in studies designed to generate hypotheses. , 1985, American journal of epidemiology.

[14]  R. Royall The Effect of Sample Size on the Meaning of Significance Tests , 1986 .

[15]  D. Weed On the logic of causal inference. , 1986, American journal of epidemiology.

[16]  K J Rothman,et al.  Significance questing. , 1986, Annals of internal medicine.

[17]  D. Johnstone,et al.  Tests of Significance in Theory and Practice , 1986 .

[18]  A M Walker,et al.  Reporting the results of epidemiologic studies. , 1986, American journal of public health.

[19]  M Susser,et al.  The logic of Sir Karl Popper and the practice of epidemiology. , 1986, American journal of epidemiology.

[20]  C Poole,et al.  Beyond the confidence interval. , 1987, American journal of public health.

[21]  J. Berger,et al.  Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence , 1987 .

[22]  W. Browner,et al.  Are all significant P values created equal? The analogy between diagnostic tests and clinical research. , 1987, JAMA.

[23]  James O. Berger,et al.  Statistical Analysis and the Illusion of Objectivity , 1988 .

[24]  S. Goodman,et al.  Evidence and scientific research. , 1988, American journal of public health.

[25]  Who took the "p" out of statistics? , 1989, Journal of vascular surgery.

[26]  S. Goodman Meta-analysis and evidence. , 1989, Controlled clinical trials.

[27]  S Greenland,et al.  Modeling and variable selection in epidemiologic analysis. , 1989, American journal of public health.

[28]  D. Salsburg Hypothesis versus significance testing for controlled clinical trials: a dialogue. , 1990, Statistics in medicine.

[29]  Sander Greenland,et al.  On the Logical Justification of Conditional Tests for Two-By-Two Contingency Tables , 1991 .

[30]  S. Goodman,et al.  A comment on replication, p-values and evidence. , 1992, Statistics in medicine.