Clustering ensemble method

A clustering ensemble aims to combine multiple clustering models to produce a better result than that of the individual clustering algorithms in terms of consistency and quality. In this paper, we propose a clustering ensemble algorithm with a novel consensus function named Adaptive Clustering Ensemble. It employs two similarity measures, cluster similarity and a newly defined membership similarity, and works adaptively through three stages. The first stage is to transform the initial clusters into a binary representation, and the second is to aggregate the initial clusters that are most similar based on the cluster similarity measure between clusters. This iterates itself adaptively until the intended candidate clusters are produced. The third stage is to further refine the clusters by dealing with uncertain objects to produce an improved final clustering result with the desired number of clusters. Our proposed method is tested on various real-world benchmark datasets and its performance is compared with other state-of-the-art clustering ensemble methods, including the Co-association method and the Meta-Clustering Algorithm. The experimental results indicate that on average our method is more accurate and more efficient.

[1]  Ana L. N. Fred,et al.  Probabilistic consensus clustering using evidence accumulation , 2013, Machine Learning.

[2]  Xin Yao,et al.  An analysis of diversity measures , 2006, Machine Learning.

[3]  Ching Y. Suen,et al.  Application of majority voting to pattern recognition: an analysis of its behavior and performance , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[4]  H. Levene Robust tests for equality of variances , 1961 .

[5]  Wenjia Wang,et al.  Object-Neighbourhood Clustering Ensemble Method , 2014, IDEAL.

[6]  Carlotta Domeniconi,et al.  Weighted cluster ensembles: Methods and analysis , 2009, TKDD.

[7]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Sandro Vega-Pons,et al.  Weighted association based methods for the combination of heterogeneous partitions , 2011, Pattern Recognit. Lett..

[9]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[10]  Ludmila I. Kuncheva,et al.  Using diversity in cluster ensembles , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[11]  Xin Yao,et al.  DIVACE: Diverse and Accurate Ensemble Learning Algorithm , 2004, IDEAL.

[12]  Sandro Vega-Pons,et al.  Weighted partition consensus via kernels , 2010, Pattern Recognit..

[13]  Ana L. N. Fred,et al.  Finding Consistent Clusters in Data Partitions , 2001, Multiple Classifier Systems.

[14]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[15]  Ludmila I. Kuncheva,et al.  Experimental Comparison of Cluster Ensemble Methods , 2006, 2006 9th International Conference on Information Fusion.

[16]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[17]  Tossapon Boongoen,et al.  LCE: a link-based cluster ensemble method for improved gene expression data analysis , 2010, Bioinform..

[18]  Arindam Banerjee,et al.  Bayesian cluster ensembles , 2009, Stat. Anal. Data Min..

[19]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[20]  Sang-Ho Lee,et al.  Heterogeneous Clustering Ensemble Method for Combining Different Cluster Results , 2006, BioDM.

[21]  Yi Ma,et al.  The Augmented Lagrange Multiplier Method for Exact Recovery of Corrupted Low-Rank Matrices , 2010, Journal of structural biology.

[22]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[23]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[24]  Hanan Ayad,et al.  Voting-Based Consensus of Data Partitions , 2008 .

[25]  Xi Wang,et al.  Clustering aggregation by probability accumulation , 2009, Pattern Recognit..

[26]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[27]  Wei Tang,et al.  Clusterer ensemble , 2006, Knowl. Based Syst..

[28]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[29]  David Haussler,et al.  Occam's Razor , 1987, Inf. Process. Lett..

[30]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[31]  M. Setnes,et al.  Compatibility-based ranking of fuzzy numbers , 1997, 1997 Annual Meeting of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.97TH8297).

[32]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[33]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[34]  Chang-Dong Wang,et al.  Ensemble clustering using factor graph , 2016, Pattern Recognit..

[35]  Xiaohui Liu,et al.  Consensus clustering and functional interpretation of gene-expression data , 2004, Genome Biology.

[36]  W. Peizhuang Pattern Recognition with Fuzzy Objective Function Algorithms (James C. Bezdek) , 1983 .

[37]  Eréndira Rendón,et al.  Internal versus External cluster validation indexes , 2011 .

[38]  Joydeep Ghosh,et al.  Value-based customer grouping from large retail data sets , 2000, SPIE Defense + Commercial Sensing.

[39]  Joan Claudi Socoró,et al.  Fuzzy Clusterers combination by positional voting for robust document clustering , 2009 .

[40]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[41]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[42]  Dorota Rozmus Analysis of Diversity-Accuracy Relations in Cluster Ensemble , 2010 .

[43]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[44]  Xiaoli Z. Fern,et al.  Cluster Ensembles for High Dimensional Clustering: An Empirical Study , 2006 .

[45]  Henry L. Roediger,et al.  Research Methods in Psychology , 1985 .

[46]  Wenjia Wang,et al.  Some fundamental issues in ensemble methods , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[47]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[48]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[49]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  Joan Claudi Socoró,et al.  Feature diversity in cluster ensembles for robust document clustering , 2006, SIGIR '06.

[51]  Xiaoli Z. Fern,et al.  Cluster Ensemble Selection , 2008, Stat. Anal. Data Min..

[52]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[53]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[54]  Tossapon Boongoen,et al.  A Link-Based Cluster Ensemble Approach for Categorical Data Clustering , 2012, IEEE Transactions on Knowledge and Data Engineering.

[55]  Anil K. Jain,et al.  Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  M. E. Houle The Relevant‐Set Correlation Model for Data Clustering , 2008, Stat. Anal. Data Min..

[57]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[58]  Carlotta Domeniconi,et al.  Weighted Clustering Ensembles , 2006, SDM.

[59]  András Frank,et al.  On Kuhn's Hungarian Method—A tribute from Hungary , 2005 .

[60]  Hui-lan Luo,et al.  Combining Multiple Clusterings using Information Theory based Genetic Algorithm , 2006, 2006 International Conference on Computational Intelligence and Security.

[61]  Michalis Vazirgiannis,et al.  Quality Scheme Assessment in the Clustering Process , 2000, PKDD.

[62]  Dale Farris,et al.  Design of Experiments With MiNITAB , 2005 .

[63]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[64]  Jinfeng Yi,et al.  Robust Ensemble Clustering by Matrix Completion , 2012, 2012 IEEE 12th International Conference on Data Mining.

[65]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[66]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[67]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[68]  Kurt Hornik,et al.  A Combination Scheme for Fuzzy Clustering , 2002, AFSS.

[69]  William F. Punch,et al.  Ensembles of partitions via data resampling , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[70]  Robert P. W. Duin,et al.  Limits on the majority vote accuracy in classifier fusion , 2003, Pattern Analysis & Applications.

[71]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[72]  G. Box Some Theorems on Quadratic Forms Applied in the Study of Analysis of Variance Problems, I. Effect of Inequality of Variance in the One-Way Classification , 1954 .

[73]  Sandro Vega-Pons,et al.  A Survey of Clustering Ensemble Algorithms , 2011, Int. J. Pattern Recognit. Artif. Intell..

[74]  Dong Liang,et al.  A Robust Color Image Quantization Algorithm Based on Knowledge Reuse of K-MeansClustering Ensemble , 2008, J. Multim..

[75]  Mohamed S. Kamel,et al.  On voting-based consensus of cluster ensembles , 2010, Pattern Recognit..

[76]  Mohamed S. Kamel,et al.  Cluster-Based Cumulative Ensembles , 2005, Multiple Classifier Systems.

[77]  Xin Yao,et al.  Diversity creation methods: a survey and categorisation , 2004, Inf. Fusion.

[78]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[79]  Carlotta Domeniconi,et al.  Weighted-Object Ensemble Clustering , 2013, 2013 IEEE 13th International Conference on Data Mining.

[80]  Natthakan Iam-On,et al.  LinkCluE: A MATLAB Package for Link-Based Cluster Ensembles , 2010 .

[81]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[82]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[83]  Wenjia Wang,et al.  A new consensus function based on dual-similarity measurements for clustering ensemble , 2015, 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[84]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[85]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[86]  Ludmila I. Kuncheva,et al.  Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy , 2003, Machine Learning.

[87]  Naomie Salim,et al.  Graph‐Based Consensus Clustering for Combining Multiple Clusterings of Chemical Structures , 2013, Molecular informatics.

[88]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[89]  Joachim M. Buhmann,et al.  Stability-Based Model Selection , 2002, NIPS.

[90]  M. Stephens EDF Statistics for Goodness of Fit and Some Comparisons , 1974 .

[91]  Selim Mimaroglu,et al.  DICLENS: Divisive Clustering Ensemble with Automatic Cluster Number , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[92]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[93]  Yunjun Gao,et al.  Probabilistic cluster structure ensemble , 2014, Inf. Sci..

[94]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[95]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[96]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.

[97]  Derek Greene,et al.  Efficient Ensemble Methods for Document Clustering , 2006 .

[98]  Chang-Dong Wang,et al.  Robust Ensemble Clustering Using Probability Trajectories , 2016, IEEE Transactions on Knowledge and Data Engineering.

[99]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[100]  Chris H. Q. Ding,et al.  Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[101]  Tossapon Boongoen,et al.  Refining Pairwise Similarity Matrix for Cluster Ensemble Problem with Cluster Relations , 2008, Discovery Science.

[102]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[103]  Lawrence O. Hall,et al.  A scalable framework for cluster ensembles , 2009, Pattern Recognit..

[104]  Ludmila I. Kuncheva,et al.  Moderate diversity for better cluster ensembles , 2006, Inf. Fusion.

[105]  Xuan Vinh Nguyen,et al.  A Set Correlation Model for Partitional Clustering , 2010, PAKDD.

[106]  Tossapon Boongoen,et al.  A Link-Based Approach to the Cluster Ensemble Problem , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[107]  David M. Lane,et al.  Online Statistics Education: A Multimedia Course of Study , 2003 .

[108]  M. Cugmas,et al.  On comparing partitions , 2015 .

[109]  Derek Greene,et al.  Ensemble clustering in medical diagnostics , 2004, Proceedings. 17th IEEE Symposium on Computer-Based Medical Systems.

[110]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[111]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[112]  Joydeep Ghosh,et al.  A distributed learning framework for heterogeneous data sources , 2005, KDD '05.

[113]  Shih-Fu Chang,et al.  Segmentation using superpixels: A bipartite graph partitioning approach , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.