Clustering stability-based Evolutionary K-Means

Evolutionary K-Means (EKM), which combines K-Means and genetic algorithm, solves K-Means’ initiation problem by selecting parameters automatically through the evolution of partitions. Currently, EKM algorithms usually choose silhouette index as cluster validity index, and they are effective in clustering well-separated clusters. However, their performance of clustering noisy data is often disappointing. On the other hand, clustering stability-based approaches are more robust to noise; yet, they should start intelligently to find some challenging clusters. It is necessary to join EKM with clustering stability-based analysis. In this paper, we present a novel EKM algorithm that uses clustering stability to evaluate partitions. We firstly introduce two weighted aggregated consensus matrices, positive aggregated consensus matrix (PA) and negative aggregated consensus matrix (NA), to store clustering tendency for each pair of instances. Specifically, PA stores the tendency of sharing the same label and NA stores that of having different labels. Based upon the matrices, clusters and partitions can be evaluated from the view of clustering stability. Then, we propose a clustering stability-based EKM algorithm CSEKM that evolves partitions and the aggregated matrices simultaneously. To evaluate the algorithm’s performance, we compare it with an EKM algorithm, two consensus clustering algorithms, a clustering stability-based algorithm and a multi-index-based clustering approach. Experimental results on a series of artificial datasets, two simulated datasets and eight UCI datasets suggest CSEKM is more robust to noise.

[1]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[2]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010, Stat. Anal. Data Min..

[3]  Fazel Famili,et al.  Evaluation and optimization of clustering in gene expression data analysis , 2004, Bioinform..

[4]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[5]  Xuan Vinh Nguyen,et al.  A Novel Approach for Automatic Number of Clusters Detection in Microarray Data Based on Consensus Clustering , 2009, 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering.

[6]  Francisco Herrera,et al.  A greedy randomized adaptive search procedure applied to the clustering problem as an initialization process using K-Means as a local search procedure , 2002, J. Intell. Fuzzy Syst..

[7]  Heiko Röglin,et al.  A bad instance for k-means++ , 2013, Theor. Comput. Sci..

[8]  Ohad Shamir,et al.  Stability and model selection in k-means clustering , 2010, Machine Learning.

[9]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[10]  Ricardo J. G. B. Campello,et al.  Towards a Fast Evolutionary Algorithm for Clustering , 2006, 2006 IEEE International Conference on Evolutionary Computation.

[11]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[12]  Xiaogang Wang,et al.  CLUES: A non-parametric clustering method based on local shrinking , 2007, Comput. Stat. Data Anal..

[13]  Ricardo J. G. B. Campello,et al.  Relative clustering validity criteria: A comparative overview , 2010 .

[14]  James C. Bezdek,et al.  Clustering with a genetically optimized approach , 1999, IEEE Trans. Evol. Comput..

[15]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[16]  Hendrik Blockeel,et al.  Using internal validity measures to compare clustering algorithms , 2015, ICML 2015.

[17]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[18]  Joshua D. Knowles,et al.  An Evolutionary Approach to Multiobjective Clustering , 2007, IEEE Transactions on Evolutionary Computation.

[19]  Ujjwal Maulik,et al.  An evolutionary technique based on K-Means algorithm for optimal clustering in RN , 2002, Inf. Sci..

[20]  Christian von Mering,et al.  Limits to robustness and reproducibility in the demarcation of operational taxonomic units. , 2015, Environmental microbiology.

[21]  Zhenfeng He,et al.  Evolutionary K-Means with pair-wise constraints , 2016, Soft Comput..

[22]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[23]  Ricardo J. G. B. Campello,et al.  Evolving clusters in gene-expression data , 2006, Inf. Sci..

[24]  Alex Alves Freitas,et al.  A Survey of Evolutionary Algorithms for Clustering , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[25]  Junhui Wang,et al.  Selection of the number of clusters via the bootstrap method , 2012, Comput. Stat. Data Anal..

[26]  Christian Hennig,et al.  Cluster-wise assessment of cluster stability , 2007, Comput. Stat. Data Anal..

[27]  Christian Hennig,et al.  Recovering the number of clusters in data sets with noise features using feature rescaling factors , 2015, Inf. Sci..

[28]  Murat Erisoglu,et al.  A new algorithm for initial cluster centers in k-means algorithm , 2011, Pattern Recognit. Lett..

[29]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[30]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[31]  Malika Charrad,et al.  NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set , 2014 .

[32]  Jia-Shung Wang,et al.  AP-Based Consensus Clustering for Gene Expression Time Series , 2010, 2010 20th International Conference on Pattern Recognition.

[33]  Jia-Shung Wang,et al.  Interpolation based consensus clustering for gene expression time series , 2015, BMC Bioinformatics.

[34]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[35]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Efficiency issues of evolutionary k-means , 2011, Appl. Soft Comput..

[36]  Md Zahidul Islam,et al.  DenClust: A Density Based Seed Selection Approach for K-Means , 2014, ICAISC.

[37]  George Michailidis,et al.  Critical limitations of consensus clustering in class discovery , 2014, Scientific Reports.

[38]  Ulrike von Luxburg,et al.  Clustering Stability: An Overview , 2010, Found. Trends Mach. Learn..

[39]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .

[40]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[41]  Ulrike von Luxburg,et al.  How the initialization affects the stability of the $k$-means algorithm , 2009, 0907.5494.

[42]  James C. Bezdek,et al.  Genetic algorithm guided clustering , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[43]  Hsin-Min Wang,et al.  A Prototypes-Embedded Genetic K-means Algorithm , 2006, 18th International Conference on Pattern Recognition (ICPR'06).