Missing data imputation using decision trees and fuzzy clustering with iterative learning

Various imputation approaches have been proposed to address the issue of missing values in data mining and machine learning applications. To improve the accuracy of missing data imputation, this paper proposes a new method called DIFC by integrating the merits of decision tress and fuzzy clustering into an iterative learning approach. To compare the performance of the DIFC method against five effective imputation methods, extensive experiments are conducted on six widely used datasets with numerical and categorical missing data, and with various amounts and types of missing values. The experimental results show that the DIFC method outperforms other methods in terms of imputation accuracy. Further experiments on the effect of missing value types demonstrate the robustness of the DIFC method in dealing with different types of missing values. This paper contributes to missing data imputation research by providing an accurate and robust method.

[1]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[2]  Asit Kumar Das,et al.  Missing value estimation for microarray data through cluster analysis , 2017, Knowledge and Information Systems.

[3]  Hamid Parvin,et al.  Imputing missing value through ensemble concept based on statistical measures , 2018, Knowledge and Information Systems.

[4]  Shichao Zhang,et al.  The Journal of Systems and Software , 2012 .

[5]  Daniela M. Witten,et al.  An Introduction to Statistical Learning: with Applications in R , 2013 .

[6]  Guohui Lin,et al.  Iterated Local Least Squares Microarray Missing Value Imputation , 2006, J. Bioinform. Comput. Biol..

[7]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[8]  Wan-Chi Siu,et al.  Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data , 2012, Pattern Recognit..

[9]  Ao Li,et al.  Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme , 2006, BMC Bioinformatics.

[10]  Alan Wee-Chung Liew,et al.  Missing Value Imputation for the Analysis of Incomplete Traffic Accident Data , 2014, ICMLC.

[11]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[12]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[13]  Subhagata Chattopadhyay,et al.  Comparing Fuzzy-C Means and K-Means Clustering Techniques: A Comprehensive Study , 2012 .

[14]  Md Zahidul Islam,et al.  FIMUS: A framework for imputing missing values using co-appearance, correlation and similarity analysis , 2014, Knowl. Based Syst..

[15]  Md Zahidul Islam,et al.  Missing value imputation using a fuzzy clustering-based EM approach , 2015, Knowledge and Information Systems.

[16]  Susan E. Bedingfield,et al.  A new iterative fuzzy clustering algorithm for multiple imputation of missing data , 2017, 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[17]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[18]  Harri Niska,et al.  Methods for imputation of missing values in air quality data sets , 2004 .

[19]  Taweh Beysolow,et al.  Introduction to Deep Learning Using R , 2017 .

[20]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[21]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[22]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[23]  Ingunn Myrtveit,et al.  Analyzing Data Sets with Missing Data: An Empirical Evaluation of Imputation Methods and Likelihood-Based Methods , 2001, IEEE Trans. Software Eng..

[24]  Alan Wee-Chung Liew,et al.  Missing value imputation for the analysis of incomplete traffic accident data , 2014, Inf. Sci..

[25]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[26]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[27]  Ricardo J. G. B. Campello,et al.  A fuzzy extension of the silhouette width criterion for cluster analysis , 2006, Fuzzy Sets Syst..

[28]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..