Using Entropy to Impute Missing Data in a Classification Task

In real applications, part of the data is usually missing. But most techniques of data analysis and data mining can only deal with complete data. In this paper, a new taxonomy of imputation methods is proposed. Within this taxonomy a new technique, based on entropy measures is introduced. Its behaviour is studied through an empirical comparative analysis.

[1]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[2]  Mingxiu Hu,et al.  EVALUATION OF SOME POPULAR IMPUTATION ALGORITHMS , 2002 .

[3]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[4]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[5]  Ying Zou,et al.  Evaluation and automatic selection of methods for handling missing data , 2005, 2005 IEEE International Conference on Granular Computing.

[6]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[7]  Suchada Supattathum Statistical Power of Modified Bonferroni Methods. , 1994 .

[8]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[9]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[10]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[11]  Carlos Soares,et al.  A Comparison of Ranking Methods for Classification Algorithm Selection , 2000, ECML.

[12]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[13]  Thanh Ha Dang,et al.  Utilisation de l'entropie pour substituer des données manquantes symboliques dans un problème de classification supervisée , 2006 .

[14]  Jerzy W. Grzymala-Busse,et al.  A Comparison of Several Approaches to Missing Attribute Values in Data Mining , 2000, Rough Sets and Current Trends in Computing.