Impact of imputation of missing values on classification error for discrete data

Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have up to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Nai@?ve-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Nai@?ve-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Nai@?ve-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Nai@?ve-Bayes were found to be missing data resistant, i.e., they can produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation.

[1]  Jerzy W. Grzymala-Busse,et al.  LERS-A System for Learning from Examples Based on Rough Sets , 1992, Intelligent Decision Support.

[2]  Witold Pedrycz,et al.  Data Mining Methods for Knowledge Discovery , 1998, IEEE Trans. Neural Networks.

[3]  Ingram Olkin,et al.  Incomplete data in sample surveys , 1985 .

[4]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[5]  Wolfgang Gaul,et al.  "Classification, Clustering, and Data Mining Applications" , 2004 .

[6]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[7]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[8]  Kim-Hung Li,et al.  Imputation using Markov chains , 1988 .

[9]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[10]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[11]  William W. Cohen Learning Trees and Rules with Set-Valued Features , 1996, AAAI/IAAI, Vol. 1.

[12]  Daniel J. Mundfrom,et al.  Imputing Missing Values: The Effect on the Accuracy of Classification , 1998 .

[13]  A. J. Feelders,et al.  Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation , 1999, PKDD.

[14]  Roderick J. A. Little,et al.  A test of missing completely at random for generalised estimating equations with missing data , 1999 .

[15]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[16]  Witold Pedrycz,et al.  Experimental analysis of methods for imputation of missing values in databases , 2004, SPIE Defense + Commercial Sensing.

[17]  Witold Pedrycz,et al.  A Novel Framework for Imputation of Missing Values in Databases , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[18]  Michael I. Jordan,et al.  Mixture models for learning from incomplete data , 1997, COLT 1997.

[19]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[20]  S. van Buuren,et al.  Flexible mutlivariate imputation by MICE , 1999 .

[21]  Jozef Zurada,et al.  Mining the Cystic Fibrosis Data , 2005 .

[22]  John K. Dixon,et al.  Pattern Recognition with Partly Missing Data , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[23]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[24]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[25]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[26]  V. Rao Vemuri,et al.  Web-Based Knowledge Acquisition to Impute Missing Values for Classification , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[27]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[28]  Tariq Samad,et al.  Imputation of Missing Data in Industrial Databases , 1999, Applied Intelligence.

[29]  Stef van Buuren,et al.  Routine multiple imputation in statistical databases , 1994, Seventh International Working Conference on Scientific and Statistical Database Management.

[30]  Joseph L Schafer,et al.  Analysis of Incomplete Multivariate Data , 1997 .

[31]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[32]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.

[33]  Weixiong Zhang,et al.  Association-Based Multiple Imputation in Multivariate Datasets: A Summary , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[34]  Russell Greiner,et al.  Computational learning theory and natural learning systems: Volume IV: making learning systems practical , 1997, COLT 1997.

[35]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[36]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[37]  Ian Witten,et al.  Data Mining , 2000 .

[38]  R. Słowiński Intelligent Decision Support: Handbook of Applications and Advances of the Rough Sets Theory , 1992 .

[39]  D. Rubin Formalizing Subjective Notions about the Effect of Nonrespondents in Sample Surveys , 1977 .

[40]  Jozef Zurada,et al.  Next Generation of Data-Mining Applications , 2005 .

[41]  Donald B. Rubin,et al.  Multiple imputations in sample surveys , 1978 .

[42]  James C. French,et al.  Proceedings of the Seventh International Working Conference on Scientific and Statistical Database Management , 1992 .

[43]  Terrence J. Sejnowski,et al.  Variational Bayesian Learning of ICA with Missing Data , 2003, Neural Computation.

[44]  W. J. Langford Statistical Methods , 1959, Nature.

[45]  David E. Booth,et al.  Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[46]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[47]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[48]  A. Bello,et al.  Imputation techniques in regression analysis: looking closely at their implementation , 1995 .