The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data

Multiple imputation can be used as a tool in the process of constructing prediction models in medical and epidemiological studies with missing covariate values. Such models can be used to make predictions for model performance assessment, but the task is made more complicated by the multiple imputation structure. We summarize various predictions constructed from covariates, including multiply imputed covariates, and either the set of imputation-specific prediction model coefficients or the pooled prediction model coefficients. We further describe approaches for using the predictions to assess model performance. We distinguish between ideal model performance and pragmatic model performance, where the former refers to the model's performance in an ideal clinical setting where all individuals have fully observed predictors and the latter refers to the model's performance in a real-world clinical setting where some individuals have missing predictors. The approaches are compared through an extensive simulation study based on the UK700 trial. We determine that measures of ideal model performance can be estimated within imputed datasets and subsequently pooled to give an overall measure of model performance. Alternative methods to evaluate pragmatic model performance are required and we propose constructing predictions either from a second set of covariate imputations which make no use of observed outcomes, or from a set of partial prediction models constructed for each potential observed pattern of covariate. Pragmatic model performance is generally lower than ideal model performance. We focus on model performance within the derivation data, but describe how to extend all the methods to a validation dataset.

[1]  Gary S Collins,et al.  An independent and external validation of QRISK2 cardiovascular disease risk score: a prospective open cohort study , 2010, BMJ : British Medical Journal.

[2]  Martijn W Heymans,et al.  The search for stable prognostic models in multiple imputed data sets , 2010, BMC medical research methodology.

[3]  P. Royston,et al.  Patrick Royston model with a binary outcome A comparison of imputation techniques for handling missing predictor values in a risk , 2007 .

[4]  L. Massuger,et al.  External validation of three prognostic models for overall survival in patients with advanced-stage epithelial ovarian cancer , 2013, British Journal of Cancer.

[5]  Michael G. Kenward,et al.  Multiple Imputation and its Application , 2013 .

[6]  Lena Osterhagen,et al.  Multiple Imputation For Nonresponse In Surveys , 2016 .

[7]  G. Heinze,et al.  Risk prediction models. , 2013, Nephrology, dialysis, transplantation : official publication of the European Dialysis and Transplant Association - European Renal Association.

[8]  G. Collins,et al.  Identifying patients with undetected colorectal cancer: an independent validation of QCancer (Colorectal) , 2012, British Journal of Cancer.

[9]  Giuseppe Limongelli,et al.  A novel clinical risk prediction model for sudden cardiac death in hypertrophic cardiomyopathy (HCM risk-SCD). , 2014, European heart journal.

[10]  Tom Burns,et al.  Intensive versus standard case management for severe psychotic illness: a randomised trial , 1999, The Lancet.

[11]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[12]  Douglas G Altman,et al.  Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines , 2009, BMC medical research methodology.

[13]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[14]  Douglas G Altman,et al.  Developing a prognostic model in the presence of missing data: an ovarian cancer case study. , 2003, Journal of clinical epidemiology.

[15]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[16]  Michael G. Kenward,et al.  Multiple Imputation and its Application: Carpenter/Multiple Imputation and its Application , 2013 .

[17]  M. Woodward,et al.  Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker , 2012, Heart.

[18]  R. Little Missing-Data Adjustments in Large Surveys , 1988 .

[19]  M. Kenward,et al.  Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls , 2009, BMJ : British Medical Journal.

[20]  Theo Stijnen,et al.  Using the outcome for imputation of missing predictor values was preferred. , 2006, Journal of clinical epidemiology.

[21]  Yvonne Vergouwe,et al.  Development and validation of a prediction model with missing predictor data: a practical approach. , 2010, Journal of clinical epidemiology.

[22]  Juan Lu,et al.  Predicting Outcome after Traumatic Brain Injury: Development and International Validation of Prognostic Scores Based on Admission Characteristics , 2008, PLoS medicine.

[23]  Patrick Royston,et al.  How should variable selection be performed with multiply imputed data? , 2008, Statistics in medicine.

[24]  Patrick Royston,et al.  Correcting for Optimistic Prediction in Small Data Sets , 2014, American journal of epidemiology.

[25]  J. Hippisley-Cox,et al.  Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study , 2007, BMJ : British Medical Journal.

[26]  M. Woodward,et al.  Risk prediction models: II. External validation, model updating, and impact assessment , 2012, Heart.

[27]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[28]  Gary S Collins,et al.  An independent external validation and evaluation of QRISK cardiovascular risk prediction: a prospective open cohort study , 2009, BMJ : British Medical Journal.

[29]  D. Cox Two further applications of a model for binary regression , 1958 .

[30]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.