Efficient missing data imputation for supervised learning

In supervised learning, missing values usually appear in the training set. The missing values in a dataset may generate bias, affecting the quality of the supervised learning process or the performance of classification algorithms. These imply that a reliable method for dealing with missing values is necessary. In this paper, we analyze the difference between iterative imputation of missing values and single imputation in real-world applications. We propose an EM-style iterative imputation method, in which each missing attribute-value is iteratively filled using a predictor constructed from the known values and predicted values of the missing attribute-values from the previous iterations. Meanwhile, we demonstrate that it is reasonable to consider the imputation ordering for patching up multiple missing attribute values, and therefore introduce a method for imputation ordering. We experimentally show that our approach significantly outperforms some standard machine learning methods for handling missing values in classification tasks.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  William A Ghali,et al.  Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. , 2002, Journal of clinical epidemiology.

[3]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Li-Chun Zhang,et al.  Nonparametric Markov chain bootstrap for multiple imputation , 2004, Comput. Stat. Data Anal..

[6]  Estevam R. Hruschka,et al.  Bayesian networks for imputation in classification problems , 2007, Journal of Intelligent Information Systems.

[7]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[8]  Judi Scheffer,et al.  Dealing with Missing Data , 2020, The Big R‐Book.

[9]  Rich Caruana,et al.  A Non-Parametric EM-Style Algorithm for Imputing Missing Values , 2001, AISTATS.

[10]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[11]  O. O. Lobo,et al.  Ordered Estimation of Missing Values for Propositional Learning , 2000 .

[12]  Claudio Conversano,et al.  Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering , 2009, J. Classif..

[13]  Susan Murray,et al.  Survival estimation and testing via multiple imputation , 2002 .

[14]  Masayuki Numao,et al.  Ordered Estimation of Missing Values , 1999, PAKDD.

[15]  John Francis Kros,et al.  Data mining and the impact of missing data , 2003, Ind. Manag. Data Syst..