Bootstrap Methods for Developing Predictive Models

Researchers frequently use automated model selection methods such as backwards elimination to identify variables that are independent predictors of an outcome under consideration. We propose using bootstrap resampling in conjunction with automated variable selection methods to develop parsimonious prediction models. Using data on patients admitted to hospital with a heart attack, we demonstrate that selecting those variables that were identified as independent predictors of mortality in at least 60%% of the bootstrap samples resulted in a parsimonious model with excellent predictive ability.

[1]  D. Jacobs,et al.  PREDICT: A simple risk score for clinical severity and long-term prognosis after hospitalization for acute myocardial infarction or unstable angina: the Minnesota heart survey. , 1999, Circulation.

[2]  P. Murtaugh,et al.  METHODS OF VARIABLE SELECTION IN REGRESSION MODELING , 1998 .

[3]  C. Coulton,et al.  Interaction Effects in Multiple Regression , 1993 .

[4]  J. Copas,et al.  Estimating the Residual Variance in Orthogonal Regression with Variable Selection , 1991 .

[5]  Alan J. Miller Sélection of subsets of regression variables , 1984 .

[6]  D. Hosmer,et al.  A comparison of goodness-of-fit tests for the logistic regression model. , 1997, Statistics in medicine.

[7]  Debashis Kushary,et al.  Bootstrap Methods and Their Application , 2000, Technometrics.

[8]  V. Flack,et al.  Frequency of Selecting Noise Variables in Subset Regression Analysis: A Simulation Study , 1987 .

[9]  E. Antman,et al.  TIMI Risk Score for ST-Elevation Myocardial Infarction: A Convenient, Bedside, Clinical Score for Risk Assessment at Presentation: An Intravenous nPA for Treatment of Infarcting Myocardium Early II Trial Substudy , 2000, Circulation.

[10]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[11]  David A Morrow,et al.  A simple risk index for rapid initial triage of patients with ST-elevation myocardial infarction: an InTIME II substudy , 2001, The Lancet.

[12]  Xiao-Hua Zhou,et al.  Statistical Methods in Diagnostic Medicine , 2002 .

[13]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[14]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[15]  R. Lewis,et al.  Statistical models and Occam's razor. , 1999, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[16]  J V Tu,et al.  Development and validation of the Ontario acute myocardial infarction mortality prediction rules. , 2001, Journal of the American College of Cardiology.

[17]  H. Keselman,et al.  Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables , 1992 .

[18]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[19]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[20]  E W Steyerberg,et al.  Predictors of outcome in patients with acute coronary syndromes without persistent ST-segment elevation. Results from an international trial of 9461 patients. The PURSUIT Investigators. , 2000, Circulation.

[21]  M Schumacher,et al.  A bootstrap resampling procedure for model building: application to the Cox regression model. , 1992, Statistics in medicine.

[22]  Geoffrey E. Hinton,et al.  A comparison of statistical learning methods on the Gusto database. , 1998, Statistics in medicine.

[23]  H. Krumholz,et al.  Comparing AMI mortality among hospitals in patients 65 years of age and older: evaluating methods of risk adjustment. , 1999, Circulation.