Assessment of Internal Validity of Prognostic Models through Bootstrapping and Multiple Imputation of Missing Data

Background: Prognostic models have clinical appeal to aid therapeutic decision making. Two main practical challenges in development of such models are assessment of validity of models and imputation of missing data. In this study, importance of imputation of missing data and application of bootstrap technique in development, simplification, and assessment of internal validity of a prognostic model is highlighted. Methods: Overall, 310 breast cancer patients were recruited. Missing data were imputed 10 times. Then to deal with sensitivity of the model due to small changes in the data (internal validity), 100 bootstrap samples were drawn from each of 10 imputed data sets leading to 1000 samples. A Cox regression model was fitted to each of 1000 samples. Only variables retained in more than 50% of samples were used in development of final model. Results: Four variables retained significant in more than 50% (i.e. 500 samples) of bootstrap samples; tumour size (91%), tumour grade (64%), history of benign breast disease (77%), and age at diagnosis (59%). Tumour size was the strongest predictor with inclusion frequency exceeding 90%. Number of deliveries was correlated with age at diagnosis (r=0.35, P<0.001). These two variables together retained significant in more than 90% of samples. Conclusion: We addressed two important methodological issues using a cohort of breast cancer patients. The algorithm combines multiple imputation of missing data and bootstrapping and has the potential to be applied in all kind of regression modelling exercises so as to address internal validity of models.

[1]  J. Bartlett,et al.  Tamoxifen resistance in early breast cancer: statistical modelling of tissue markers to improve risk prediction , 2010, British Journal of Cancer.

[2]  P. Royston,et al.  Modelling to extract more information from clinical trials data: On some roles for the bootstrap , 2007, Statistics in medicine.

[3]  Willem van Mechelen,et al.  Variable selection under multiple imputation using the bootstrap in a prognostic study , 2007, BMC medical research methodology.

[4]  P. Royston,et al.  Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials , 1999 .

[5]  Ewout W Steyerberg,et al.  Internal and external validation of predictive models: a simulation study of bias and precision in small samples. , 2003, Journal of clinical epidemiology.

[6]  M Schumacher,et al.  A bootstrap resampling procedure for model building: application to the Cox regression model. , 1992, Statistics in medicine.

[7]  M. Baneshi,et al.  Does the Missing Data Imputation Method Affect the Composition and Performance of Prognostic Models? , 2012, Iranian Red Crescent medical journal.

[8]  K. Covinsky,et al.  Assessing the Generalizability of Prognostic Information , 1999, Annals of Internal Medicine.

[9]  D E Grobbee,et al.  External validation is necessary in prediction research: a clinical example. , 2003, Journal of clinical epidemiology.

[10]  Peter C. Austin,et al.  Bootstrap Methods for Developing Predictive Models in Cardiovascular Research , 2004 .

[11]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[12]  Francisco Azuaje,et al.  Genomic data sampling and its effect on classification performance assessment , 2003, BMC Bioinformatics.

[13]  J. Habbema,et al.  Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. , 2001, Journal of clinical epidemiology.

[14]  D G Altman,et al.  What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[15]  M. Baneshi,et al.  Multiple Imputation in Survival Models: Applied on Breast Cancer Data , 2011, Iranian Red Crescent medical journal.

[16]  Prevention of Disease Complications through Diagnostic Models: How to Tackle the Problem of Missing Data? , 2012, Iranian journal of public health.

[17]  D. Altman,et al.  Bootstrap investigation of the stability of a Cox regression model. , 1989, Statistics in medicine.

[18]  Peter C Austin,et al.  Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. , 2004, Journal of clinical epidemiology.

[19]  C. C. Chen,et al.  The bootstrap and identification of prognostic factors via Cox's proportional hazards regression model. , 1985, Statistics in medicine.

[20]  H. Keselman,et al.  Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables , 1992 .

[21]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[22]  F. Harrell,et al.  Regression models for prognostic prediction: advantages, problems, and suggested solutions. , 1985, Cancer treatment reports.