Similarity-based Combination of Multiple Clusterings

Consensus clustering refers to combining multiple clusterings over a common set of objects into a single consolidated partition. After introducing the distribution-based view of partitions, we propose a series of entropy-based distance functions for comparing various partitions. Given a candidate partition set, consensus clustering is then formalized as an optimization problem of searching for a centroid partition with the smallest distance to that set. In addition to directly selecting the local centroid candidate, we also present two combining methods for the global centroid based on the new similarity determined by the whole candidate set. The centroid partition is likely to be top/middle-ranked in terms of closeness to the true partition. Finally we evaluate its effectiveness on both artificial and real datasets, with candidates from either the full space or the subspace.

[1]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[2]  R. Tibshirani,et al.  Additive Logistic Regression : a Statistical View ofBoostingJerome , 1998 .

[3]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[4]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[5]  Joydeep Ghosh,et al.  Multiclassifier Systems: Back to the Future , 2002, Multiple Classifier Systems.

[6]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[7]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[8]  S. Ross A First Course in Probability , 1977 .

[9]  Andreas Stafylopatis,et al.  A Multi-clustering Fusion Algorithm , 2002, SETN.

[10]  Douglas H. Fisher,et al.  Iterative Optimization and Simplification of Hierarchical Clusterings , 1996, J. Artif. Intell. Res..

[11]  Trevor Hastie,et al.  Additive Logistic Regression : a Statistical , 1998 .

[12]  Paul S. Bradley,et al.  Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[13]  Ana L. N. Fred,et al.  Evidence Accumulation Clustering Based on the K-Means Algorithm , 2002, SSPR/SPR.

[14]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[15]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: applications in VLSI domain , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[16]  Andreas Stafylopatis,et al.  A clustering method based on boosting , 2004, Pattern Recognit. Lett..

[17]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[18]  David Harel,et al.  Clustering spatial data using random walks , 2001, KDD '01.

[19]  P. Michaud,et al.  Condorcet — a man of the avant‐garde , 1987 .

[20]  George Karypis,et al.  Multilevel Hypergraph Partitioning , 2003 .

[21]  Hillol Kargupta,et al.  Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[22]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[23]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[24]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[25]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.