Missing value imputation using a fuzzy clustering-based EM approach

Data preprocessing and cleansing play a vital role in data mining by ensuring good quality of data. Data-cleansing tasks include imputation of missing values, identification of outliers, and identification and correction of noisy data. In this paper, we present a novel technique called A Fuzzy Expectation Maximization and Fuzzy Clustering-based Missing Value Imputation Framework for Data Pre-processing (FEMI). It imputes numerical and categorical missing values by making an educated guess based on records that are similar to the record having a missing value. While identifying a group of similar records and making a guess based on the group, it applies a fuzzy clustering approach and our novel fuzzy expectation maximization algorithm. We evaluate FEMI on eight publicly available natural data sets by comparing its performance with the performance of five high-quality existing techniques, namely EMI, GkNN, FKMI, SVR and IBLLS. We use thirty-two types (patterns) of missing values for each data set. Two evaluation criteria namely root mean squared error and mean absolute error are used. Our experimental results indicate (according to a confidence interval and $$t$$t test analysis) that FEMI performs significantly better than EMI, GkNN, FKMI, SVR, and IBLLS.

[1]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[2]  Md Zahidul Islam,et al.  kDMI: A Novel Method for Missing Values Imputation Using Two Levels of Horizontal Partitioning in a Data set , 2013, ADMA.

[3]  Shengrui Wang,et al.  FCM-Based Model Selection Algorithms for Determining the Number of Clusters , 2004, Pattern Recognit..

[4]  Taghi M. Khoshgoftaar,et al.  Empirical Case Studies in Attribute Noise Detection , 2009, IEEE Trans. Syst. Man Cybern. Part C.

[5]  Francisco Herrera,et al.  On the choice of the best imputation methods for missing values considering three groups of classification methods , 2012, Knowledge and Information Systems.

[6]  Shichao Zhang,et al.  Shell-neighbor method and its application in missing data imputation , 2011, Applied Intelligence.

[7]  T. Schneider Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. , 2001 .

[8]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[9]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[10]  Mario F. Triola,et al.  Elementary Statistics Using Excel (3rd Edition) , 2006 .

[11]  Wan-Chi Siu,et al.  Iterative bicluster-based least square framework for estimation of missing values in microarray gene expression data , 2012, Pattern Recognit..

[12]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[14]  Harri Niska,et al.  Methods for imputation of missing values in air quality data sets , 2004 .

[15]  Shouhong Wang,et al.  Mining incomplete survey data through classification , 2010, Knowledge and Information Systems.

[16]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[17]  D. Pham,et al.  Selection of K in K-means clustering , 2005 .

[18]  Shin Ishii,et al.  A Bayesian missing value estimation method for gene expression profile data , 2003, Bioinform..

[19]  Lu Li,et al.  Privacy-preserving LOF outlier detection , 2013, Knowledge and Information Systems.

[20]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[21]  G. King,et al.  What to Do about Missing Values in Time‐Series Cross‐Section Data , 2010 .

[22]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[23]  Zahidul Islam,et al.  A Novel Framework Using Two Layers of Missing Value Imputation , 2013 .

[24]  Vincent S. Tseng,et al.  A pre-processing method to deal with missing values by integrating clustering and regression techniques , 2003, Appl. Artif. Intell..

[25]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[26]  Gene H. Golub,et al.  Missing value estimation for DNA microarray gene expression data: local least squares imputation , 2005, Bioinform..

[27]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[28]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[29]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[30]  Witold Pedrycz,et al.  The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features , 2009, Fuzzy Sets Syst..

[31]  Ran Wolff,et al.  In-Network Outlier Detection in Wireless Sensor Networks , 2006, ICDCS.

[32]  Gustavo E. A. P. A. Batista,et al.  An analysis of four missing data treatment methods for supervised learning , 2003, Appl. Artif. Intell..

[33]  Zahidul Islam,et al.  Data Quality Improvement by Imputation of Missing Values , 2013 .

[34]  M. Kendall Elementary Statistics , 1945, Nature.

[35]  Vwani P. Roychowdhury,et al.  Parallel randomized sampling for support vector machine (SVM) and support vector regression (SVR) , 2008, Knowledge and Information Systems.

[36]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[37]  Md Zahidul Islam,et al.  Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques , 2013, Knowl. Based Syst..

[38]  Junbin Gao,et al.  CAIRAD: A co-appearance based analysis for Incorrect Records and Attribute-values Detection , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[39]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[40]  Guohui Lin,et al.  Iterated Local Least Squares Microarray Missing Value Imputation , 2006, J. Bioinform. Comput. Biol..

[41]  Zili Zhang,et al.  Missing Value Estimation for Mixed-Attribute Data Sets , 2011, IEEE Transactions on Knowledge and Data Engineering.

[42]  Shichao Zhang,et al.  Clustering-based Missing Value Imputation for Data Preprocessing , 2006, 2006 4th IEEE International Conference on Industrial Informatics.

[43]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[44]  Shichao Zhang,et al.  The Journal of Systems and Software , 2012 .

[45]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[46]  Ao Li,et al.  Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme , 2006, BMC Bioinformatics.

[47]  Longbing Cao,et al.  SVDD-based outlier detection on uncertain data , 2012, Knowledge and Information Systems.

[48]  Doheon Lee,et al.  Fuzzy clustering of categorical data using fuzzy centroids , 2004, Pattern Recognit. Lett..

[49]  Mou'ath Hourani,et al.  Microarray missing values imputation methods: Critical analysis review , 2009, Comput. Sci. Inf. Syst..

[50]  Tao Xiong,et al.  A combined SVM and LDA approach for classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[51]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[52]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[53]  Sotirios Chatzis,et al.  A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional , 2011, Expert Syst. Appl..

[54]  Jitender S. Deogun,et al.  Towards Missing Data Imputation: A Study of Fuzzy K-means Clustering Method , 2004, Rough Sets and Current Trends in Computing.

[55]  Xiaofeng Zhu,et al.  Missing data imputation by utilizing information within incomplete instances , 2011, J. Syst. Softw..

[56]  Md Zahidul Islam,et al.  A Decision Tree-based Missing Value Imputation Technique for Data Pre-processing , 2011, AusDM.