Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

BackgroundThere is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.MethodsDatasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.ResultsPerforming a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.ConclusionThe results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

[1]  D G Altman,et al.  Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines , 2004, British Journal of Cancer.

[2]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[3]  H. Y. Chen,et al.  Double-Semiparametric Method for Missing Covariates in Cox Regression Models , 2002 .

[4]  Jürgen Unützer,et al.  A comparison of imputation methods in a longitudinal randomized clinical trial , 2005, Statistics in medicine.

[5]  Maria Blettner,et al.  Missing Data in Epidemiologic Studies , 2005 .

[6]  EVALUATION OF PROC IMPUTE AND SCHAFER'S IMPUTATION SOFTWARE , 2002 .

[7]  Sunil J Rao,et al.  Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis , 2003 .

[8]  Karel G M Moons,et al.  Diagnostic research on routine care data: prospects and problems. , 2003, Journal of clinical epidemiology.

[9]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[10]  Xiao-Li Meng,et al.  Multiple-Imputation Inferences with Uncongenial Sources of Input , 1994 .

[11]  D B Rubin,et al.  Multiple imputation in health-care databases: an overview and some applications. , 1991, Statistics in medicine.

[12]  Joseph L Schafer,et al.  Modeling and imputation of semicontinuous survey variables , 1999 .

[13]  S Greenland,et al.  A critical look at methods for handling missing covariates in epidemiologic regression analyses. , 1995, American journal of epidemiology.

[14]  Nicholas J. Horton,et al.  A Potential for Bias When Rounding in Multiple Imputation , 2003 .

[15]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[16]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[17]  Douglas G Altman,et al.  Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines , 2009, BMC medical research methodology.

[18]  Robert J Glynn,et al.  Bias due to missing exposure data using complete‐case analysis in the proportional hazards regression model , 2003, Statistics in medicine.

[19]  N. Schenker,et al.  Maximum likelihood estimation for linear regression models with right censored outcomes and missing predictors , 1999 .

[20]  J G Ibrahim,et al.  Estimating equations with incomplete categorical covariates in the Cox model. , 1998, Biometrics.

[21]  Patrick Royston,et al.  The design of simulation studies in medical statistics , 2006, Statistics in medicine.

[22]  J G Ibrahim,et al.  Using the EM-algorithm for survival data with incomplete categorical covariates , 1996, Lifetime data analysis.

[23]  Patrick Royston,et al.  How should variable selection be performed with multiply imputed data? , 2008, Statistics in medicine.

[24]  Ralf Bender,et al.  Generating survival times to simulate Cox proportional hazards models , 2005, Statistics in medicine.

[25]  M Schumacher,et al.  Modelling the effects of standard prognostic factors in node-positive breast cancer , 1999, British Journal of Cancer.

[26]  William A Ghali,et al.  Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. , 2002, Journal of clinical epidemiology.

[27]  D G Altman,et al.  A prognostic model for ovarian cancer , 2001, British Journal of Cancer.

[28]  Roderick J. A. Little,et al.  The NHANES III multiple imputation project , 1996 .

[29]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[30]  Patrick Royston,et al.  A new measure of prognostic separation in survival data , 2004, Statistics in medicine.

[31]  J. Schafer,et al.  A comparison of inclusive and restrictive strategies in modern missing data procedures. , 2001, Psychological methods.

[32]  Jeremy MG Taylor,et al.  Partially parametric techniques for multiple imputation , 1996 .

[33]  F. Kong Adjusting regression attenuation in the Cox proportional hazards model , 1999 .

[34]  Donald B. Rubin,et al.  Significance levels from repeated p-values with multiply imputed data , 1991 .

[35]  Joseph G. Ibrahim,et al.  Non‐ignorable missing covariate data in survival analysis: a case‐study of an International Breast Cancer Study Group trial , 2004 .

[36]  Mark Woodward,et al.  Imputations of missing values in practice: results from imputations of serum cholesterol in 28 cohort studies. , 2004, American journal of epidemiology.