Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health

Sample estimates derived from data with missing values may be unreliable and may negatively impact the inferences that researchers make about the underlying population due to nonresponse bias. As a result, imputation is often preferred to listwise deletion in handling multivariate missing data. In this study, we compared three popular imputation methods: sequential multiple imputation, fractional hot-deck imputation, and generalized efficient regression-based imputation with latent processes for handling multivariate missingness under different missing patterns by conducting descriptive and regression analyses on the imputed data and seeing how the estimates differ from those generated from the full sample. Limited Monte Carlo simulation results by using the National Health Nutrition and Examination Survey and Behavioral Risk Factor Surveillance System are presented to demonstrate the effect of each imputation method on reducing bias and increasing efficiency for the parameter estimate of interest for that particular incomplete variable. Although these three methods did not always outperform listwise deletion in our simulated missing patterns, they improved many descriptive and regression estimates when used to impute all incomplete variables at once.

[1]  D. Labarthe,et al.  Status of Cardiovascular Health in US Adults and Children Using the American Heart Association’s New “Life’s Essential 8” Metrics: Prevalence Estimates From the National Health and Nutrition Examination Survey (NHANES), 2013 Through 2018 , 2022, Circulation.

[2]  Xiaoyu Dong,et al.  Missing data imputation for traffic congestion data based on joint matrix factorization , 2021, Knowl. Based Syst..

[3]  Felix Bießmann,et al.  A Benchmark for Data Imputation Methods , 2021, Frontiers in Big Data.

[4]  Fan Li,et al.  Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison , 2021, ArXiv.

[5]  C. Ricci,et al.  Biomarker association with cardiovascular disease and mortality - The role of fibrinogen. A report from the NHANES study. , 2020, Thrombosis research.

[6]  Yu Sun,et al.  Imputing Various Incomplete Attributes via Distance Likelihood Maximization , 2020, KDD.

[7]  Przemyslaw Biecek,et al.  Does imputation matter? Benchmark for predictive models , 2020, ArXiv.

[8]  Jianmin Wang,et al.  Enriching Data Imputation under Similarity Rule Constraints , 2020, IEEE Transactions on Knowledge and Data Engineering.

[9]  Dhanya Pramod,et al.  Comparison of Performance of Data Imputation Methods for Numeric Dataset , 2019, Appl. Artif. Intell..

[10]  Shu Yang Flexible Imputation of Missing Data, 2nd ed. , 2019, Journal of the American Statistical Association.

[11]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[12]  Md Hamidul Huque,et al.  A comparison of multiple imputation methods for missing data in longitudinal studies , 2018, BMC Medical Research Methodology.

[13]  Stef van Buuren,et al.  Flexible Imputation of Missing Data, Second Edition , 2018 .

[14]  In Ho Cho,et al.  FHDI: An R Package for Fractional Hot Deck Imputation , 2018, R J..

[15]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[16]  Mickael Guedj,et al.  A Comparison of Six Methods for Missing Data Imputation , 2015 .

[17]  P. Johnson,et al.  Differences in Diabetes Self-care Activities by Race/Ethnicity and Insulin Use , 2013, The Diabetes educator.

[18]  J. Marrero,et al.  Comparison of imputation methods for missing laboratory data in medicine , 2013, BMJ Open.

[19]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[20]  M. Doescher,et al.  Health professional advice for smoking and weight in adults with and without diabetes: findings from BRFSS , 2013, Journal of Behavioral Medicine.

[21]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[22]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[23]  H. Y. Chen Compatibility of conditionally specified models. , 2010, Statistics & probability letters.

[24]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[25]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[26]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[27]  Giuseppe Polese,et al.  RENUVER: A Missing Value Imputation Algorithm based on Relaxed Functional Dependencies , 2022, EDBT.

[28]  OUP accepted manuscript , 2021, Journal of Survey Statistics and Methodology.

[29]  Dimitris Bertsimas,et al.  From Predictive Methods to Missing Data Imputation: An Optimization Approach , 2017, J. Mach. Learn. Res..