Does the Missing Data Imputation Method Affect the Composition and Performance of Prognostic Models?

Background We already showed the superiority of imputation of missing data (via Multivariable Imputation via Chained Equations (MICE) method) over exclusion of them; however, the methodology of MICE is complicated. Furthermore, easier imputation methods are available. The aim of this study was to compare them in terms of model composition and performance. Methods Three hundreds and ten breast cancer patients were recruited. Four approaches were applied to impute missing data. First we adopted an ad hoc method in which missing data for each variable was replaced by the median of observed values. Then 3 likelihood-based approaches were used. In the regression imputation, a regression model compared the variable with missing data to the rest of the variables. The regression equation was used to fill the missing data. The Expectation Maximum (E-M) algorithm was implemented in which missing data and regression parameters were estimated iteratively until convergence of regression parameters. Finally, the MICE method was applied. Models developed were compared in terms of variables significantly contributed to the multifactorial analysis, sensitivity and specificity. Results All candidate variables significantly contributed to the MICE model. However, grade of disease lost its effect in other three models. The MICE model showed the best performance followed by E-M model. Conclusion Among imputation methods, final models were not the same, in terms of composition and performance. Therefore, modern imputation methods are recommended to recover the information.

[1]  Pierre Côté,et al.  Methods to Account for Attrition in Longitudinal Data: Do They Work? A Simulation Study , 2005, European Journal of Epidemiology.

[2]  D G Altman,et al.  Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines , 2004, British Journal of Cancer.

[3]  D. Novins,et al.  Sequences of substance use among American Indian adolescents. , 2001, Journal of the American Academy of Child and Adolescent Psychiatry.

[4]  H. Boshuizen,et al.  Multiple imputation of missing blood pressure covariates in survival analysis. , 1999, Statistics in medicine.

[5]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[6]  J. Bartlett,et al.  Tamoxifen resistance in early breast cancer: statistical modelling of tissue markers to improve risk prediction , 2010, British Journal of Cancer.

[7]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[8]  Mark Woodward,et al.  Imputations of missing values in practice: results from imputations of serum cholesterol in 28 cohort studies. , 2004, American journal of epidemiology.

[9]  D. Novins,et al.  Methods for addressing missing data in psychiatric and developmental research. , 2005, Journal of the American Academy of Child and Adolescent Psychiatry.

[10]  L. Ried,et al.  Missing data on the Center for Epidemiologic Studies Depression Scale: a comparison of 4 imputation techniques. , 2007, Research in social & administrative pharmacy : RSAP.

[11]  Allan Donner,et al.  The Relative Effectiveness of Procedures Commonly Used in Multiple Regression Analysis for Dealing with Missing Values , 1982 .

[12]  D G Altman,et al.  Modeling the effects of continuous risk factors. , 2000, Journal of clinical epidemiology.

[13]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[14]  N. Shokrpour,et al.  Cancer Occurrence in Fars Province, Southern Iran , 2008 .

[15]  A Rogier T Donders,et al.  Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. , 2006, Journal of clinical epidemiology.

[16]  M. Baneshi,et al.  Multiple Imputation in Survival Models: Applied on Breast Cancer Data , 2011, Iranian Red Crescent medical journal.

[17]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[18]  S M Kneipp,et al.  Handling Missing Data in Nursing Research With Multiple Imputation , 2001, Nursing research.

[19]  P. Royston,et al.  Patrick Royston model with a binary outcome A comparison of imputation techniques for handling missing predictor values in a risk , 2007 .

[20]  F. Mokarian,et al.  Epidemiology and trend of cancer in Isfahan 2005-2010 , 2011, Journal of research in medical sciences : the official journal of Isfahan University of Medical Sciences.

[21]  Hude Quan,et al.  Bmc Medical Research Methodology Open Access Dealing with Missing Data in a Multi-question Depression Scale: a Comparison of Imputation Methods , 2022 .

[22]  Carol M Musil,et al.  A Comparison of Imputation Techniques for Handling Missing Data , 2002, Western journal of nursing research.

[23]  M. Baneshi,et al.  Dichotomisation of Continuous Data: Review of Methods, Advantages, and Disadvantages , 2011 .