Clustering ensembles: models of consensus and weak partitions

Clustering ensembles have emerged as a powerful method for improving both the robustness as well as the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial, or statistical perspectives. This study extends previous research on clustering ensembles in several respects. First, we introduce a unified representation for multiple clusterings and formulate the corresponding categorical clustering problem. Second, we propose a probabilistic model of consensus using a finite mixture of multinomial distributions in a space of clusterings. A combined partition is found as a solution to the corresponding maximum-likelihood problem using the EM algorithm. Third, we define a new consensus function that is related to the classical intraclass variance criterion using the generalized mutual information definition. Finally, we demonstrate the efficacy of combining partitions generated by weak clustering algorithms that use data projections and random data splits. A simple explanatory model is offered for the behavior of combinations of such weak clustering components. Combination accuracy is analyzed as a function of several parameters that control the power and resolution of component partitions as well as the number of partitions. We also analyze clustering ensembles with incomplete information and the effect of missing cluster labels on the quality of overall consensus. Experimental results demonstrate the effectiveness of the proposed methods on several real-world data sets.

[1]  Jan Havrda,et al.  Quantification method of classification processes. Concept of structural a-entropy , 1967, Kybernetika.

[2]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[5]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[6]  Ryszard S. Michalski,et al.  Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Mark A. Gluck,et al.  Information, Uncertainty and the Utility of Categories , 1985 .

[8]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[9]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[10]  S. Odewahn,et al.  Automated star/galaxy discrimination with neural networks , 1992 .

[11]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[12]  Jean-Pierre Barthélemy,et al.  The Median Procedure for Partitions , 1993, Partitioning Data Sets.

[13]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[14]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[15]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: application in VLSI domain , 1997, DAC.

[16]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[17]  L. Breiman Arcing Classifiers , 1998 .

[18]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[20]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[21]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[22]  F. Leisch Bagged Clustering , 1999 .

[23]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[24]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: applications in VLSI domain , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[25]  Yuntao Qian,et al.  Clustering combination method , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[26]  Wolfgang von der Gablentz,et al.  Robust Clustering by Evolutionary Computation , 2000 .

[27]  Anil K. Jain,et al.  Dimensionality reduction using genetic algorithms , 2000, IEEE Trans. Evol. Comput..

[28]  V. Akila,et al.  Information , 2001, The Lancet.

[29]  Allan Tucker,et al.  Comparing, Contrasting and Combining Clusters in Viral Gene Expression , 2001 .

[30]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[31]  Kurt Hornik,et al.  Voting-Merging: An Ensemble Method for Clustering , 2001, ICANN.

[32]  Ana L. N. Fred,et al.  Finding Consistent Clusters in Data Partitions , 2001, Multiple Classifier Systems.

[33]  Stan Lipovetsky,et al.  Latent Variable Models and Factor Analysis , 2001, Technometrics.

[34]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[35]  Ana L. N. Fred,et al.  Evidence Accumulation Clustering Based on the K-Means Algorithm , 2002, SSPR/SPR.

[36]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[37]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[39]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[40]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[41]  Joachim M. Buhmann,et al.  Path-Based Clustering for Grouping of Smooth Curves and Texture Segmentation , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Daniel A. Keim,et al.  Using projections to visually cluster high-dimensional data , 2003, Comput. Sci. Eng..

[43]  Joachim M. Buhmann,et al.  Bagging for Path-Based Clustering , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[45]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[46]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[47]  N. O. Manning,et al.  The protein data bank , 1999, Genetica.

[48]  William F. Punch,et al.  Ensembles of partitions via data resampling , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[49]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[50]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[51]  Boris G. Mirkin,et al.  Reinterpreting the Category Utility Function , 2001, Machine Learning.

[52]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[53]  Bernard Toursel,et al.  Distributed Data Mining , 2001, Scalable Comput. Pract. Exp..

[54]  E. M. Kleinberg,et al.  Stochastic discrimination , 1990, Annals of Mathematics and Artificial Intelligence.