Using the bootstrap to improve estimation and confidence intervals for regression coefficients selected using backwards variable elimination

Applied researchers frequently use automated model selection methods, such as backwards variable elimination, to develop parsimonious regression models. Statisticians have criticized the use of these methods for several reasons, amongst them are the facts that the estimated regression coefficients are biased and that the derived confidence intervals do not have the advertised coverage rates. We developed a method to improve estimation of regression coefficients and confidence intervals which employs backwards variable elimination in multiple bootstrap samples. In a given bootstrap sample, predictor variables that are not selected for inclusion in the final regression model have their regression coefficient set to zero. Regression coefficients are averaged across the bootstrap samples, and non-parametric percentile bootstrap confidence intervals are then constructed for each regression coefficient. We conducted a series of Monte Carlo simulations to examine the performance of this method for estimating regression coefficients and constructing confidence intervals for variables selected using backwards variable elimination. We demonstrated that this method results in confidence intervals with superior coverage compared with those developed from conventional backwards variable elimination. We illustrate the utility of our method by applying it to a large sample of subjects hospitalized with a heart attack.

[1]  Peter C Austin,et al.  Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. , 2004, Journal of clinical epidemiology.

[2]  Clifford M. Hurvich,et al.  The impact of model selection on inference in linear regression , 1990 .

[3]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[4]  E. Blackstone,et al.  Gender and outcomes after coronary artery bypass grafting: a propensity-matched comparison. , 2003, The Journal of thoracic and cardiovascular surgery.

[5]  P. Austin,et al.  Missed opportunities in the secondary prevention of myocardial infarction: an assessment of the effects of statin underprescribing on mortality. , 2006, American heart journal.

[6]  P. Austin,et al.  The use of the propensity score for estimating treatment effects: administrative versus clinical data , 2005, Statistics in medicine.

[7]  H. Keselman,et al.  Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables , 1992 .

[8]  D. Wood,et al.  Patient and surgical factors influencing air leak after lung volume reduction surgery: lessons learned from the National Emphysema Treatment Trial. , 2006, The Annals of thoracic surgery.

[9]  Peter C Austin,et al.  A comparison of propensity score methods: a case‐study estimating the effectiveness of post‐AMI statin use , 2006, Statistics in medicine.

[10]  Peter C Austin,et al.  Bootstrap Methods for Developing Predictive Models , 2004 .

[11]  Alan J. Miller Sélection of subsets of regression variables , 1984 .

[12]  M. Shishehbor,et al.  Comparison of outcomes in patients undergoing coronary bypass of patent versus restenosed bare metal stented coronary arteries. , 2005, The American journal of cardiology.

[13]  Peter C. Austin,et al.  Comparing clinical data with administrative data for producing acute myocardial infarction report cards , 2006 .

[14]  B. Lytle,et al.  Cannulation of the axillary artery with a side graft reduces morbidity. , 2004, The Annals of thoracic surgery.

[15]  P. Murtaugh,et al.  METHODS OF VARIABLE SELECTION IN REGRESSION MODELING , 1998 .

[16]  V. Flack,et al.  Frequency of Selecting Noise Variables in Subset Regression Analysis: A Simulation Study , 1987 .

[17]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[18]  E. Blackstone,et al.  Brain metastases from esophageal cancer: a phenomenon of adjuvant therapy? , 2006, The Annals of thoracic surgery.

[19]  B. Lytle,et al.  Does the arterial cannulation site for circulatory arrest influence stroke risk? , 2004, The Annals of thoracic surgery.

[20]  J. Copas,et al.  Estimating the Residual Variance in Orthogonal Regression with Variable Selection , 1991 .

[21]  N. Smedira,et al.  Does off-pump coronary surgery reduce morbidity and mortality? ☆ ☆☆ , 2002 .

[22]  Peter C Austin,et al.  A comparison of regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality , 2007, Statistics in medicine.

[23]  J. Habbema,et al.  Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. , 2000, Statistics in medicine.

[24]  M Schumacher,et al.  A bootstrap resampling procedure for model building: application to the Cox regression model. , 1992, Statistics in medicine.