Comparison of five imputation methods in handling missing data in a continuous frequency table

Missing data are sometimes inevitable that could affect the overall results of research. Sometimes missing data that occurs in data render the continuous frequency table incomplete, and hence the need to estimate them to arrive at valid results. Thus to estimate the missing data, it is appropriate to use one of the scientific imputation methods reported in the literature. This study aims to compare five different missing data imputation methods, mean imputation, median imputation, k nearest neighbors, sample imputation, and multiple imputations by using chained equations (MICE). The five imputation methods are compared using four real datasets. Nine different percentages of missingness are introduced completely at random into the datasets. The statistical metric, root-mean-squared error (RMSE), is used to assess the performance of the methods. Results show that the multiple imputations by using chained equations (MICE) outperformed the other imputation methods. The mean and k nearest neighbor (KNN) performed better relative to sample and median imputation methods. The five imputation methods’ performance is independent of the dataset and the percentage of missingness.

[1]  G. Molenberghs,et al.  Multiple imputation for ordinal longitudinal data with monotone missing data patterns , 2017 .

[2]  Nuryazmin Ahmat Zainuri,et al.  A comparison of various imputation methods for missing values in air quality data , 2015 .

[3]  M. Islam,et al.  Analyzing Incomplete Categorical Data: Revisiting Maximum Likelihood Estimation (Mle) Procedure , 2008 .

[4]  Hyun Kang The prevention and handling of the missing data , 2013, Korean journal of anesthesiology.

[5]  Thomas Clausen,et al.  How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data , 2019, SAGE open medicine.

[6]  Frederico Z. Poleto,et al.  Comparing diagnostic tests with missing data , 2011 .

[7]  Mickael Guedj,et al.  A Comparison of Six Methods for Missing Data Imputation , 2015 .

[8]  S S Avtar,et al.  Comparison between EM Algorithm and Multiple Imputation on Predicting Children’s Weight at School Entry , 2019 .

[9]  L. Chou,et al.  An empirical analysis of land property lawsuits and rainfalls , 2016, SpringerPlus.

[10]  Shichao Zhang,et al.  The Journal of Systems and Software , 2012 .

[11]  Dhanya Pramod,et al.  Comparison of Performance of Data Imputation Methods for Numeric Dataset , 2019, Appl. Artif. Intell..

[12]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[13]  Lorenzo Beretta,et al.  Nearest neighbor imputation algorithms: a critical evaluation , 2016, BMC Medical Informatics and Decision Making.

[14]  John B Carlin,et al.  A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures , 2012, BMC Medical Research Methodology.

[15]  Patrick E. McKnight Missing Data: A Gentle Introduction , 2007 .

[16]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[17]  Daniel McNeish,et al.  Missing data methods for arbitrary missingness with small samples , 2017 .

[18]  Yan Lin,et al.  Missing value imputation in high-dimensional phenomic data: imputable or not, and how? , 2014, BMC Bioinformatics.

[19]  Laura A. Dwyer,et al.  Comparison of Multiple Imputation Methods for Categorical Survey Items with High Missing Rates: Application to the Family Life, Activity, Sun, Health and Eating (FLASHE) Study , 2018, Journal of Modern Applied Statistical Methods.

[20]  J. Deville,et al.  On balanced random imputation in surveys , 2011 .

[21]  Gustavo E. A. P. A. Batista,et al.  A Study of K-Nearest Neighbour as an Imputation Method , 2002, HIS.

[22]  M. B. Mohammed,et al.  Improved frequency table’s measures of skewness and kurtosis with application to weather data , 2020, Communications in Statistics - Theory and Methods.

[23]  D. Campbell,et al.  EXPERIMENTAL AND QUASI-EXPERIMENT Al DESIGNS FOR RESEARCH , 2012 .

[24]  Jared S. Murray,et al.  Multiple Imputation: A Review of Practical and Theoretical Findings , 2018, 1801.04058.

[25]  Per Winkel,et al.  When and how should multiple imputation be used for handling missing data in randomised clinical trials – a practical guide with flowcharts , 2017, BMC Medical Research Methodology.

[26]  Jehanzeb R. Cheema Regular Articles: Some General Guidelines for Choosing Missing Data Handling Methods in Educational Research , 2014 .

[27]  Peter Filzmoser,et al.  Exploring incomplete data using visualization techniques , 2012, Adv. Data Anal. Classif..

[28]  Paul Zhang,et al.  Multiple imputation of missing data with ante-dependence covariance structure , 2005 .

[29]  Garrett Fitzmaurice Missing data: implications for analysis. , 2008, Nutrition.

[30]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .