Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality.

OBJECTIVES Automated variable selection methods are frequently used to determine the independent predictors of an outcome. The objective of this study was to determine the reproducibility of logistic regression models developed using automated variable selection methods. STUDY DESIGN AND SETTING An initial set of 29 candidate variables were considered for predicting mortality after acute myocardial infarction (AMI). We drew 1,000 bootstrap samples from a dataset consisting of 4,911 patients admitted to hospital with an AMI. Using each bootstrap sample, logistic regression models predicting 30-day mortality were obtained using backward elimination, forward selection, and stepwise selection. The agreement between the different model selection methods and the agreement across the 1,000 bootstrap samples were compared. RESULTS Using 1,000 bootstrap samples, backward elimination identified 940 unique models for predicting mortality. Similar results were obtained for forward and stepwise selection. Three variables were identified as independent predictors of mortality among all bootstrap samples. Over half the candidate prognostic variables were identified as independent predictors in less than half of the bootstrap samples. CONCLUSION Automated variable selection methods result in models that are unstable and not reproducible. The variables selected as independent predictors are sensitive to random fluctuations in the data.

[1]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[2]  L. Pierard,et al.  Short-term risk stratification at admission based on simple clinical data in acute myocardial infarction. , 1988 .

[3]  Frank E. Harrell,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2001 .

[4]  V. Flack,et al.  Frequency of Selecting Noise Variables in Subset Regression Analysis: A Simulation Study , 1987 .

[5]  H. Krumholz,et al.  Comparing AMI mortality among hospitals in patients 65 years of age and older: evaluating methods of risk adjustment. , 1999, Circulation.

[6]  Sander Greenland,et al.  Modern Epidemiology 3rd edition , 1986 .

[7]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[8]  J. Fleiss,et al.  Risk stratification and survival after myocardial infarction. , 1983, The New England journal of medicine.

[9]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[10]  R. Lewis,et al.  Statistical models and Occam's razor. , 1999, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[11]  P. Austin,et al.  Temporal changes in the outcomes of acute myocardial infarction in Ontario, 1992-1996. , 1999, CMAJ : Canadian Medical Association journal = journal de l'Association medicale canadienne.

[12]  H. Keselman,et al.  Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables , 1992 .

[13]  A. Vera,et al.  Prediction on admission of in-hospital mortality in patients older than 70 years with acute myocardial infarction. , 1995, Chest.

[14]  D. Altman,et al.  Bootstrap investigation of the stability of a Cox regression model. , 1989, Statistics in medicine.

[15]  P. Murtaugh,et al.  METHODS OF VARIABLE SELECTION IN REGRESSION MODELING , 1998 .

[16]  J. Copas,et al.  Estimating the Residual Variance in Orthogonal Regression with Variable Selection , 1991 .

[17]  Alan J. Miller Subset Selection in Regression , 1992 .

[18]  Alan J. Miller Sélection of subsets of regression variables , 1984 .

[19]  H. V. Henderson,et al.  Building Multiple Regression Models Interactively , 1981 .

[20]  Peter C Austin,et al.  Bootstrap Methods for Developing Predictive Models , 2004 .