论文信息 - A missing data imputation approach using clustering and maximum likelihood estimation

A missing data imputation approach using clustering and maximum likelihood estimation

Missing data is a data mining problem that adversely affects data analysis and decision making processes that are frequently encountered in healthcare data for a variety of reasons. Missing data is still an important research topic because the success of the method is influenced by many factors such as the characteristics of the data and the type of the missing data. In this study, a clustering and maximum likelihood estimation (MLE) based approach to the missing data problem is proposed. In order to test the proposed method, the “Mesothelioma” (Mesothelioma) data set prepared by the Dicle University Medical School and uploaded to UCI international open source database was used. New data sets have been created that are compatible with missing data patterns such as Missing completely at random (MCAR), Missing at random (MAR), and Missing not at random (MNAR). In the second step, these new data sets are divided into clusters in order to increase the computation success of the MLE method by a k-means clustering process in which 3 features with missing data are not included. In the last step, the missing data are completed with the MLE method for these clusters in which the features with missing values are added again, and the clusters are merged to obtain the complete data set. The new data sets obtained as a result of the completed operations in three steps (data reduction, clustering and data completion) were compared with the original data set according to the root mean square error (RMSE) criterion, and an average of 96.5% success was achieved.

[1] Stefan E. Wilson. Methods for Clustering Data with Missing Values , 2016 .

[2] Craig K. Enders,et al. Applied Missing Data Analysis , 2010 .

[3] Md Zahidul Islam,et al. Missing value imputation using a fuzzy clustering-based EM approach , 2015, Knowledge and Information Systems.

[4] James C. Bezdek,et al. Fuzzy c-means clustering of incomplete data , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[5] Richard G. Baraniuk,et al. k-POD: A Method for k-Means Clustering of Missing Data , 2014, 1411.7013.

[6] C. Y. Peng,et al. Principled missing data methods for researchers , 2013, SpringerPlus.

[7] K. Wagstaff,et al. Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy , 2004 .

[8] Paul D. Allison,et al. Handling Missing Data by Maximum Likelihood , 2012 .

[9] H. Ichihashi,et al. Simultaneous approach to principal component analysis and fuzzy clustering with missing values , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).