Missing covariate data in medical research: to impute is better than to ignore.

OBJECTIVE We compared popular methods to handle missing data with multiple imputation (a more sophisticated method that preserves data). STUDY DESIGN AND SETTING We used data of 804 patients with a suspicion of deep venous thrombosis (DVT). We studied three covariates to predict the presence of DVT: d-dimer level, difference in calf circumference, and history of leg trauma. We introduced missing values (missing at random) ranging from 10% to 90%. The risk of DVT was modeled with logistic regression for the three methods, that is, complete case analysis, exclusion of d-dimer level from the model, and multiple imputation. RESULTS Multiple imputation showed less bias in the regression coefficients of the three variables and more accurate coverage of the corresponding 90% confidence intervals than complete case analysis and dropping d-dimer level from the analysis. Multiple imputation showed unbiased estimates of the area under the receiver operating characteristic curve (0.88) compared with complete case analysis (0.77) and when the variable with missing values was dropped (0.65). CONCLUSION As this study shows that simple methods to deal with missing data can lead to seriously misleading results, we advise to consider multiple imputation. The purpose of multiple imputation is not to create data, but to prevent the exclusion of observed data.

[1]  D G Altman,et al.  Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines , 2004, British Journal of Cancer.

[2]  John W Seaman,et al.  Multiple imputation techniques in small sample clinical trials , 2006, Statistics in medicine.

[3]  D B Rubin,et al.  Multiple imputation in health-care databases: an overview and some applications. , 1991, Statistics in medicine.

[4]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[5]  W. Ageno,et al.  The Wells rule was not useful in ruling out deep venous thrombosis in a primary care setting. , 2006, Evidence-based medicine.

[6]  Ken P Kleinman,et al.  Much Ado About Nothing , 2007, The American statistician.

[7]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[8]  J. Hanley,et al.  A method of comparing the areas under receiver operating characteristic curves derived from the same cases. , 1983, Radiology.

[9]  Qingxia Chen,et al.  Dealing with missing predictor values when applying clinical prediction models. , 2009, Clinical chemistry.

[10]  Lawrence Joseph,et al.  Multiple Imputation to Account for Missing Data in a Survey: Estimating the Prevalence of Osteoporosis , 2002, Epidemiology.

[11]  Roderick J. A. Little Regression with Missing X's: A Review , 1992 .

[12]  William A Ghali,et al.  Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. , 2002, Journal of clinical epidemiology.

[13]  Stephen R Cole,et al.  Use of multiple imputation in the epidemiologic literature. , 2008, American journal of epidemiology.

[14]  Nicholas J. Horton,et al.  Multiple Imputation in Practice , 2001 .

[15]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[16]  Karel G M Moons,et al.  Ruling out deep venous thrombosis in primary care , 2005, Thrombosis and Haemostasis.

[17]  J. Schafer Multiple imputation: a primer , 1999, Statistical methods in medical research.

[18]  Theo Stijnen,et al.  Using the outcome for imputation of missing predictor values was preferred. , 2006, Journal of clinical epidemiology.

[19]  J. Schafer,et al.  A comparison of inclusive and restrictive strategies in modern missing data procedures. , 2001, Psychological methods.

[20]  A. Hoes,et al.  Limited value of patient history and physical examination in diagnosing deep vein thrombosis in primary care. , 2004, Family practice.

[21]  R. Little,et al.  Methods for handling missing values in clinical trials. , 1999, The Journal of rheumatology.

[22]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[23]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[24]  F. Harrell,et al.  Regression modelling strategies for improved prognostic prediction. , 1984, Statistics in medicine.

[25]  Douglas G Altman,et al.  Developing a prognostic model in the presence of missing data: an ovarian cancer case study. , 2003, Journal of clinical epidemiology.

[26]  Karel Moons,et al.  The Wells Rule Does Not Adequately Rule Out Deep Venous Thrombosis in Primary Care Patients , 2005, Annals of Internal Medicine.

[27]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .