Variable selection under multiple imputation using the bootstrap in a prognostic study

Background: Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable selection. Method: In our prospective cohort study we merged data from three different randomized controlled trials (RCTs) to assess prognostic variables for chronicity of low back pain. Among the outcome and prognostic variables data were missing in the range of 0 and 48.1%. We used four methods to investigate the influence of respectively sampling and imputation variation: MI only, bootstrap only, and two methods that combine MI and bootstrapping. Variables were selected based on the inclusion frequency of each prognostic variable, i.e. the proportion of times that the variable appeared in the model. The discriminative and calibrative abilities of prognostic models developed by the four methods were assessed at different inclusion levels. Results: We found that the effect of imputation variation on the inclusion frequency was larger than the effect of sampling variation. When MI and bootstrapping were combined at the range of 0% (full model) to 90% of variable selection, bootstrap corrected c-index values of 0.70 to 0.71 and slope values of 0.64 to 0.86 were found. Conclusion: We recommend to account for both imputation and sampling variation in sets of missing data. The new procedure of combining MI with bootstrapping for variable selection, results in multivariable prognostic models with good performance and is therefore attractive to apply on data sets with missing values. Published: 13 July 2007 BMC Medical Research Methodology 2007, 7:33 doi:10.1186/1471-2288-7-33 Received: 6 October 2006 Accepted: 13 July 2007 This article is available from: http://www.biomedcentral.com/1471-2288/7/33 © 2007 Heymans et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

[1]  D. Rubin,et al.  Fully conditional specification in multivariate imputation , 2006 .

[2]  W. Mechelen,et al.  The effectiveness of graded activity for low back pain in occupational healthcare , 2006, Occupational and Environmental Medicine.

[3]  D. Knol,et al.  The Effectiveness of High-Intensity Versus Low-Intensity Back Schools in an Occupational Setting: A Pragmatic Randomized Controlled Trial , 2006, Spine.

[4]  Willi Sauerbrei,et al.  The practical utility of incorporating model selection uncertainty into prognostic models for survival data , 2005 .

[5]  Ian R White,et al.  Comparison of imputation and modelling methods in the analysis of a physical activity trial with missing outcomes. , 2004, International journal of epidemiology.

[6]  D G Altman,et al.  Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines , 2004, British Journal of Cancer.

[7]  Peter C Austin,et al.  Bootstrap Methods for Developing Predictive Models , 2004 .

[8]  J. Twisk,et al.  Graded Activity for Low Back Pain in Occupational Health Care , 2004, Annals of Internal Medicine.

[9]  P. Royston,et al.  Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation , 2003, Statistics in medicine.

[10]  A. Verbeek,et al.  Psychometric properties of the Tampa Scale for kinesiophobia and the fear-avoidance beliefs questionnaire in acute low back pain. , 2003, Manual therapy.

[11]  Yvonne Vergouwe,et al.  Validity of prognostic models: when is a model clinically useful? , 2002, Seminars in urologic oncology.

[12]  J. Schafer,et al.  A comparison of inclusive and restrictive strategies in modern missing data procedures. , 2001, Psychological methods.

[13]  S. Richardson,et al.  Variable selection and Bayesian model averaging in case‐control studies , 2001, Statistics in medicine.

[14]  J. Dul,et al.  Dutch Musculoskeletal Questionnaire: description and basic qualities , 2001, Ergonomics.

[15]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[16]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[17]  W. E. van der Weide,et al.  Prognostic factors for chronic disability from acute low-back pain in occupational health care. , 1999, Scandinavian journal of work, environment & health.

[18]  Anthony C. Davison,et al.  Bootstrap Methods and their Application , 1997 .

[19]  L. Bouter,et al.  The Prognosis of Low Back Pain in General Practice , 1997, Spine.

[20]  K. Burnham,et al.  Model selection: An integral part of inference , 1997 .

[21]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[22]  S. Crawford,et al.  A comparison of anlaytic methods for non-random missingness of outcome data. , 1995, Journal of clinical epidemiology.

[23]  G. Waddell,et al.  A Fear-Avoidance Beliefs Questionnaire (FABQ) and the role of fear-avoidance beliefs in chronic low back pain and disability , 1993, Pain.

[24]  W. Fordyce,et al.  A Prospective Study of Work Perceptions and Psychosocial Factors Affecting the Report of Back Injury , 1991, Spine.

[25]  Subir Ghosh,et al.  Statistical Analysis With Missing Data , 1988 .

[26]  C. Chatfield 19. Statistical Analysis with Missing Data , 1988 .

[27]  A. Carlsson Assessment of chronic pain. I. Aspects of the reliability and validity of the visual analogue scale , 1983, Pain.

[28]  J E Frijters,et al.  A short questionnaire for the measurement of habitual physical activity in epidemiological studies. , 1982, The American journal of clinical nutrition.

[29]  W. Sauerbrei,et al.  Investigation on the Improvement of Prediction by Bootstrap Model Averaging , 2006, Methods of Information in Medicine.

[30]  Douglas G Altman,et al.  Developing a prognostic model in the presence of missing data: an ovarian cancer case study. , 2003, Journal of clinical epidemiology.

[31]  Pre-publication history , 2000 .

[32]  Willi Sauerbrei,et al.  The Use of Resampling Methods to Simplify Regression Models in Medical Statistics , 1999 .

[33]  N. Kawakami,et al.  The Job Content Questionnaire (JCQ): an instrument for internationally comparative assessments of psychosocial job characteristics. , 1998, Journal of occupational health psychology.

[34]  M Schumacher,et al.  A bootstrap resampling procedure for model building: application to the Cox regression model. , 1992, Statistics in medicine.

[35]  C. C. Chen,et al.  The bootstrap and identification of prognostic factors via Cox's proportional hazards regression model. , 1985, Statistics in medicine.