A new consensus function based on dual-similarity measurements for clustering ensemble

Clustering ensemble is an unsupervised learning method, which combines a number of partitions in order to produce a better clustering result. In this paper, we have proposed a clustering ensemble algorithm named Dual-Similarity Clustering Ensemble (DSCE). The core of our ensemble is a consensus function, consists of three stages. The first stage is to transform the initial clusters into a binary representation, and the second is to measure the similarity between initial clusters and merge the most similar ones. The third is to identify candidate clusters, which contain only certain objects, and calculate their quality. The final clustering result is produced by an iterative process assigning the uncertain objects to a cluster that has a minimum effect on its quality. The number of clusters in the final clustering result converges to a stable value from the generated member, in contrast to most existing methods that require the user to provide the number of clusters in advance. The Experimental results on real datasets indicate that our method is statistically significant better than other state-of-the-art clustering ensemble methods including CO and DICLENS algorithms.

[1]  L. Hubert,et al.  Comparing partitions , 1985 .

[2]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[3]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[4]  M. E. Houle The Relevant‐Set Correlation Model for Data Clustering , 2008, Stat. Anal. Data Min..

[5]  Sandro Vega-Pons,et al.  Weighted association based methods for the combination of heterogeneous partitions , 2011, Pattern Recognit. Lett..

[6]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[7]  Carlotta Domeniconi,et al.  Weighted-Object Ensemble Clustering , 2013, 2013 IEEE 13th International Conference on Data Mining.

[8]  Carla E. Brodley,et al.  Proceedings of the twenty-first international conference on Machine learning , 2004, International Conference on Machine Learning.

[9]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[10]  Xiaohui Liu,et al.  Consensus clustering and functional interpretation of gene-expression data , 2004, Genome Biology.

[11]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[12]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[13]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[14]  K JainAnil,et al.  Combining Multiple Clusterings Using Evidence Accumulation , 2005 .

[15]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[16]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[17]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[19]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[20]  Wenjia Wang,et al.  Object-Neighbourhood Clustering Ensemble Method , 2014, IDEAL.

[21]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[22]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[23]  Joydeep Ghosh,et al.  Value-based customer grouping from large retail data sets , 2000, SPIE Defense + Commercial Sensing.

[24]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[25]  Selim Mimaroglu,et al.  DICLENS: Divisive Clustering Ensemble with Automatic Cluster Number , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Xuan Vinh Nguyen,et al.  A Set Correlation Model for Partitional Clustering , 2010, PAKDD.

[27]  Tossapon Boongoen,et al.  A Link-Based Approach to the Cluster Ensemble Problem , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.