Fuzzy C-Means clustering of incomplete data based on probabilistic information granules of missing values

Missing values are a common phenomenon when dealing with real-world data sets. Analysis of incomplete data sets has become an active area of research. In this paper, we focus on the problem of clustering incomplete data, which is intended to introduce some prior distribution information of the missing values into the algorithm of fuzzy clustering. First, non-parametric hypothesis testing is employed to describe the missing values adhering to a certain Gaussian distribution as probabilistic information granules based on the nearest neighbors of incomplete data. Second, we propose a novel clustering model, in which probabilistic information granules of missing values are incorporated into the Fuzzy C-Means clustering of incomplete data by involving the maximum likelihood criterion. Third, the clustering model is optimized by using a tri-level alternating optimization utilizing the method of Lagrange multipliers. The convergence and the time complexity of the clustering algorithm are also discussed. The experiments reported both on synthetic and real-world data sets demonstrate that the proposed approach can effectively realize clustering of incomplete data.

[1]  Sadok Ben Yahia,et al.  A New Algorithm for Fuzzy Clustering Handling Incomplete Dataset , 2014, Int. J. Artif. Intell. Tools.

[2]  Shichao Zhang,et al.  The Journal of Systems and Software , 2012 .

[3]  Witold Pedrycz,et al.  Knowledge-based clustering - from data to information granules , 2007 .

[4]  James C. Bezdek,et al.  Fuzzy c-means clustering of incomplete data , 2001, IEEE Trans. Syst. Man Cybern. Part B.

[5]  Alvin S. Lim,et al.  A distributed middleware for self-configurable wireless sensor networks , 2014, Int. J. Sens. Networks.

[6]  Renata M. C. R. de Souza,et al.  A multivariate fuzzy c-means method , 2013, Appl. Soft Comput..

[7]  Stefan Conrad,et al.  Fuzzy Clustering of Incomplete Data Based on Cluster Dispersion , 2010, IPMU.

[8]  John K. Dixon,et al.  Pattern Recognition with Partly Missing Data , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[10]  James C. Bezdek,et al.  Optimization of clustering criteria by reformulation , 1995, IEEE Trans. Fuzzy Syst..

[11]  Hidetomo Ichihashi,et al.  PCA-guided k-Means clustering with incomplete data , 2011, 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011).

[12]  Gabriele Steidl,et al.  A new fuzzy c-means method with total variation regularization for segmentation of images with noisy and incomplete data , 2012, Pattern Recognit..

[13]  Takahiro Yamanoi,et al.  A study on a fuzzy clustering for mixed numerical and categorical incomplete data , 2013, 2013 International Conference on Fuzzy Theory and Its Applications (iFUZZY).

[14]  Hidetomo Ichihashi,et al.  FCMdd-type linear fuzzy clustering for incomplete non-Euclidean relational data , 2011, 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011).

[15]  Namhyun Kim The limit distribution of a modified Shapiro–Wilk statistic for normality to Type II censored data , 2011 .

[16]  Arun Ross,et al.  A comparison of imputation methods for handling missing scores in biometric fusion , 2012, Pattern Recognit..

[17]  John F. Kolen,et al.  Reducing the time complexity of the fuzzy c-means algorithm , 2002, IEEE Trans. Fuzzy Syst..

[19]  S. Shapiro,et al.  An Approximate Analysis of Variance Test for Normality , 1972 .

[20]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[21]  Niki Pissinou,et al.  Fuzzy belief pattern classification of incomplete data , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[22]  A. Henderson Testing experimental data for univariate normality. , 2006, Clinica chimica acta; international journal of clinical chemistry.

[23]  Li Zhang,et al.  A hybrid clustering algorithm based on missing attribute interval estimation for incomplete data , 2014, Pattern Analysis and Applications.

[24]  Krzysztof Simiński,et al.  Clustering with Missing Values , 2013, Fundam. Informaticae.

[25]  Chee Peng Lim,et al.  A Hybrid Neural Network System for Pattern Classification Tasks with Missing Features , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Dao-Qiang Zhang,et al.  Clustering Incomplete Data Using Kernel-Based Fuzzy C-means Algorithm , 2003, Neural Processing Letters.

[27]  Chen Hong,et al.  Clustering Algorithm for Incomplete Data Sets with Mixed Numeric and Categorical Attributes , 2013 .

[28]  Taghi M. Khoshgoftaar,et al.  Incomplete-Case Nearest Neighbor Imputation in Software Measurement Data , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[29]  Aoying Zhou,et al.  Distributed Data Stream Clustering: A Fast EM-based Approach , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Prem Raj Adhikari,et al.  Fast progressive training of mixture models for model selection , 2013, Journal of Intelligent Information Systems.

[31]  Hidetomo Ichihashi,et al.  Linear fuzzy clustering techniques with missing values and their application to local principal component analysis , 2004, IEEE Transactions on Fuzzy Systems.

[32]  Ming Dong,et al.  Selection-fusion approach for classification of datasets with missing values , 2010, Pattern Recognit..

[33]  Witold Pedrycz,et al.  Distributed proximity-based granular clustering: towards a development of global structural relationships in data , 2015, Soft Comput..

[34]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[35]  James C. Bezdek,et al.  Clustering incomplete relational data using the non-Euclidean relational fuzzy c-means algorithm , 2002, Pattern Recognit. Lett..

[36]  Zhikui Chen,et al.  A Distributed Weighted Possibilistic c-Means Algorithm for Clustering Incomplete Big Sensor Data , 2014, Int. J. Distributed Sens. Networks.

[37]  Hsiu J. Ho,et al.  On fast supervised learning for normal mixture models with missing information , 2006, Pattern Recognit..

[38]  Boqin Feng,et al.  Fuzzy clustering of incomplete nominal and numerical data , 2004, Fifth World Congress on Intelligent Control and Automation (IEEE Cat. No.04EX788).

[39]  H. Ichihashi,et al.  Simultaneous approach to principal component analysis and fuzzy clustering with missing values , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[40]  Nuno Gonçalo Costa Fernandes Marques de Abreu Análise do perfil do cliente Recheio e desenvolvimento de um sistema promocional , 2011 .

[41]  James C. Bezdek,et al.  Local convergence of tri-level alternating optimization , 2001, Neural Parallel Sci. Comput..

[42]  Heiko Timm,et al.  Different approaches to fuzzy clustering of incomplete datasets , 2004, Int. J. Approx. Reason..

[43]  Tsung-I Lin,et al.  Learning from incomplete data via parameterized t mixture models through eigenvalue decomposition , 2014, Comput. Stat. Data Anal..

[44]  Witold Pedrycz,et al.  An interval weighed fuzzy c-means clustering by genetically guided alternating optimization , 2014, Expert Syst. Appl..

[45]  S. Shapiro,et al.  An Analysis of Variance Test for Normality (Complete Samples) , 1965 .

[46]  Ali Selamat,et al.  Systematic mapping study on granular computing , 2015, Knowl. Based Syst..

[47]  Cheng Wu,et al.  Robust Bayesian Classification with Incomplete Data , 2012, Cognitive Computation.