Variable selection and Bayesian model averaging in case‐control studies

Covariate and confounder selection in case‐control studies is often carried out using a statistical variable selection method, such as a two‐step method or a stepwise method in logistic regression. Inference is then carried out conditionally on the selected model, but this ignores the model uncertainty implicit in the variable selection process, and so may underestimate uncertainty about relative risks. We report on a simulation study designed to be similar to actual case‐control studies. This shows that p‐values computed after variable selection can greatly overstate the strength of conclusions. For example, for our simulated case‐control studies with 1000 subjects, of variables declared to be ‘significant’ with p‐values between 0.01 and 0.05, only 49 per cent actually were risk factors when stepwise variable selection was used. We propose Bayesian model averaging as a formal way of taking account of model uncertainty in case‐control studies. This yields an easily interpreted summary, the posterior probability that a variable is a risk factor, and our simulation study indicates this to be reasonably well calibrated in the situations simulated. The methods are applied and compared in the context of a case‐control study of cervical cancer. Copyright © 2001 John Wiley & Sons, Ltd.

[1]  L. M. M.-T. Theory of Probability , 1929, Nature.

[2]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[3]  R. Plackett Discrete Multivariate Analysis: Theory and Practice , 1976 .

[4]  Edward E. Leamer,et al.  Specification Searches: Ad Hoc Inference with Nonexperimental Data , 1980 .

[5]  N. Breslow,et al.  Statistical methods in cancer research: volume 1- The analysis of case-control studies , 1980 .

[6]  Charles A. Ingene,et al.  Specification Searches: Ad Hoc Inference with Nonexperimental Data , 1980 .

[7]  N. Breslow,et al.  Statistical methods in cancer research. Vol. 1. The analysis of case-control studies. , 1981 .

[8]  H. Morgenstern,et al.  Epidemiologic Research: Principles and Quantitative Methods. , 1983 .

[9]  D. Freedman A Note on Screening Regression Equations , 1983 .

[10]  Donald B. Rubin,et al.  Efficiently Simulating the Coverage Properties of Interval Estimates , 1986 .

[11]  A. F. Smith,et al.  Bayesian Methods in Practice: Experiences in the Pharmaceutical Industry , 1986 .

[12]  B. Henderson,et al.  Risk factors for invasive cervical cancer among Latinas and non-Latinas in Los Angeles County. , 1986, Journal of the National Cancer Institute.

[13]  J. Berger,et al.  Testing Precise Hypotheses , 1987 .

[14]  S Greenland,et al.  The impact of confounder selection criteria on effect estimation. , 1989, American journal of epidemiology.

[15]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[16]  L. Joseph,et al.  Bayesian Statistics: An Introduction , 1989 .

[17]  Alan J. Miller,et al.  Subset Selection in Regression , 1991 .

[18]  S. Cénée,et al.  The role of fat, animal protein and some vitamin consumption in breast cancer: A case control study in Southern France , 2007, International journal of cancer.

[19]  A. Atkinson Subset Selection in Regression , 1992 .

[20]  D A Savitz,et al.  Statistical significance testing in the American Journal of Epidemiology, 1970-1990. , 1994, American journal of epidemiology.

[21]  D. Madigan,et al.  Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam's Window , 1994 .

[22]  Adrian E. Raftery,et al.  Accounting for Model Uncertainty in Survival Analysis Improves Predictive Performance , 1995 .

[23]  David Draper,et al.  Assessment and Propagation of Model Uncertainty , 2011 .

[24]  C. Chatfield Model uncertainty, data mining and statistical inference , 1995 .

[25]  N. Breslow,et al.  Statistics in Epidemiology : The Case-Control Study , 2008 .

[26]  A. Raftery Approximate Bayes factors and accounting for model uncertainty in generalised linear models , 1996 .

[27]  D. Madigan,et al.  A method for simultaneous variable selection and outlier identification in linear regression , 1996 .

[28]  D. Madigan,et al.  Bayesian Model Averaging in Proportional Hazard Models: Assessing the Risk of a Stroke , 1997 .

[29]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[30]  Roger E. Kirk,et al.  Statistics: An Introduction , 1998 .

[31]  Robert W. Wilson,et al.  Regressions by Leaps and Bounds , 2000, Technometrics.

[32]  P. Simpson,et al.  Statistical methods in cancer research , 2001, Journal of surgical oncology.

[33]  K. Rothman,et al.  Modern Epidemiology Second Edition , 2003 .