A Comparison of the Effects of Data Imputation Methods on Model Performance

Missing values cause critical problems on training a prediction model. Various missing data imputation methods have been introduced to settle down the problem. However, the imputation accuracy obtained by the methods is insufficient to validate performance of prediction models. Thus, in this study, we compare (1) imputation accuracy from various imputation methods as well as (2) the effects of imputation methods on prediction accuracy, investigating a relationship between imputation accuracy and prediction accuracy. For the comparison, we use water quality data composed of the latest actual observational multi-sensor data from Daecheong Lake. We conduct several experiments to compare seven imputation methods including a state of the art method, and their effects on three distinct prediction models. Through quantitative comparison and analysis, we proved that it is necessary to consider both imputation accuracy and model prediction accuracy when choosing an imputation method.

[1]  A. Malpertuy,et al.  Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments , 2010, BMC Genomics.

[2]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[3]  Nor Azam Ramli,et al.  Comparison of Linear Interpolation Method and Mean Method to Replace the Missing Values in Environmental Data Set , 2014 .

[4]  András A. Benczúr,et al.  Methods for large scale SVD with missing values , 2007 .

[5]  Alain Baccini,et al.  yaImpute: An R Package for kNN Imputation , 2007 .

[6]  Gary King,et al.  Amelia II: A Program for Missing Data , 2011 .

[7]  D. Rubin Multiple imputation for nonresponse in surveys , 1989 .

[8]  Christine Nardini,et al.  Missing value estimation methods for DNA methylation data , 2019, Bioinform..

[9]  S. Lipsitz,et al.  Missing-Data Methods for Generalized Linear Models , 2005 .

[10]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[11]  Gary King,et al.  AMELIA: A Program for Missing Data (software) , 1999 .

[12]  Pablo M. Olmos,et al.  Handling Incomplete Heterogeneous Data using VAEs , 2018, Pattern Recognit..

[13]  Guy N. Brock,et al.  Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes , 2008, BMC Bioinformatics.

[14]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[15]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[16]  Mickael Guedj,et al.  A Comparison of Six Methods for Missing Data Imputation , 2015 .

[17]  Yisheng Lv,et al.  A deep learning based approach for traffic data imputation , 2014, 17th International IEEE Conference on Intelligent Transportation Systems (ITSC).

[18]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[19]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..