Handling missing predictor values when validating and applying a prediction model to new patients

Abstract Missing data present challenges for development and real‐world application of clinical prediction models. While these challenges have received considerable attention in the development setting, there is only sparse research on the handling of missing data in applied settings. The main unique feature of handling missing data in these settings is that missing data methods have to be performed for a single new individual, precluding direct application of mainstay methods used during model development. Correspondingly, we propose that it is desirable to perform model validation using missing data methods that transfer to practice in single new patients. This article compares existing and new methods to account for missing data for a new individual in the context of prediction. These methods are based on (i) submodels based on observed data only, (ii) marginalization over the missing variables, or (iii) imputation based on fully conditional specification (also known as chained equations). They were compared in an internal validation setting to highlight the use of missing data methods that transfer to practice while validating a model. As a reference, they were compared to the use of multiple imputation by chained equations in a set of test patients, because this has been used in validation studies in the past. The methods were evaluated in a simulation study where performance was measured by means of optimism corrected C‐statistic and mean squared prediction error. Furthermore, they were applied in data from a large Dutch cohort of prophylactic implantable cardioverter defibrillator patients.

[1]  Sarah Fletcher Mercaldo,et al.  Missing data and prediction: the pattern submodel. , 2018, Biostatistics.

[2]  M. Woodward,et al.  Risk prediction models: II. External validation, model updating, and impact assessment , 2012, Heart.

[3]  A. Shirom,et al.  Obesity-related correlation between C-reactive protein and the calculated 10-y Framingham Coronary Heart Disease Risk Score , 2005, International Journal of Obesity.

[4]  A. Zwinderman,et al.  Validation of prediction models based on lasso regression with multiply imputed data , 2014, BMC Medical Research Methodology.

[5]  Karel G M Moons,et al.  A new framework to enhance the interpretation of external validation studies of clinical prediction models. , 2015, Journal of clinical epidemiology.

[6]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[7]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[8]  James R Carpenter,et al.  Joint modelling rationale for chained equations , 2014, BMC Medical Research Methodology.

[9]  Patrick Royston,et al.  Multiple imputation using chained equations: Issues and guidance for practice , 2011, Statistics in medicine.

[10]  Michael Schomaker,et al.  Bootstrap inference when using multiple imputation , 2016, Statistics in medicine.

[11]  John B. Carlin,et al.  Model checking in multiple imputation: an overview and case study , 2017, Emerging Themes in Epidemiology.

[12]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[13]  J. Hippisley-Cox,et al.  Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study , 2017, British Medical Journal.

[14]  D. Mozaffarian,et al.  The Seattle Heart Failure Model: Prediction of Survival in Heart Failure , 2006, Circulation.

[15]  M. Woodward,et al.  Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker , 2012, Heart.

[16]  Frank E. Harrell,et al.  Prediction models need appropriate internal, internal-external, and external validation. , 2016, Journal of clinical epidemiology.

[17]  A. Sheikh,et al.  Predicting cardiovascular risk in England and Wales: prospective derivation and validation of QRISK2 , 2008, BMJ : British Medical Journal.

[18]  D G Altman,et al.  What do we mean by validating a prognostic model? , 2000, Statistics in medicine.

[19]  Anne-Laure Boulesteix,et al.  Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation , 2016, BMC Medical Research Methodology.

[20]  Yvonne Vergouwe,et al.  Development and validation of a prediction model with missing predictor data: a practical approach. , 2010, Journal of clinical epidemiology.

[21]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[22]  A. Zwinderman,et al.  Dutch outcome in implantable cardioverter-defibrillator therapy (DO-IT): registry design and baseline characteristics of a prospective observational cohort study to predict appropriate indication for implantable cardioverter-defibrillator , 2017, Netherlands Heart Journal.

[23]  Berthold Schweizer Introduction to Copulas , 2007 .

[24]  Wayne C Levy,et al.  Seattle Heart Failure Model. , 2013, The American journal of cardiology.

[25]  A. Gelman,et al.  ON THE STATIONARY DISTRIBUTION OF ITERATIVE IMPUTATIONS , 2010, 1012.2902.

[26]  Qingxia Chen,et al.  Dealing with missing predictor values when applying clinical prediction models. , 2009, Clinical chemistry.

[27]  Guillermo Marshall,et al.  Prospective prediction in the presence of missing data , 2002, Statistics in medicine.

[28]  Paul H. C. Eilers,et al.  Fast and compact smoothing on large multidimensional grids , 2006, Comput. Stat. Data Anal..

[29]  Yvonne Vergouwe,et al.  Prognosis and prognostic research: validating a prognostic model , 2009, BMJ : British Medical Journal.

[30]  E. Steyerberg,et al.  Prognosis Research Strategy (PROGRESS) 3: Prognostic Model Research , 2013, PLoS medicine.

[31]  Shu Yang Flexible Imputation of Missing Data, 2nd ed. , 2019, Journal of the American Statistical Association.