论文信息 - Consensus clustering

Consensus clustering

We address the consensus clustering problem of combining multiple partitions of a set of objects into a single consolidated partition. The input here is a set of cluster labelings and we do not access the original data or clustering algorithms that determine these partitions. After introducing the distribution-based view of partitions, we propose a series of entropy-based distance functions for comparing various partitions. Given a candidate partition set, consensus clustering is then formalized as an optimization problem of searching for a centroid partition with the smallest distance to that set. In addition to directly selecting the local centroid candidate, we also present two combining methods based on similarity-based graph partitioning. Under certain conditions, the centroid partition is likely to be top/middle-ranked in terms of closeness to the true partition. Finally we evaluate its effectiveness on both artificial and real datasets, with candidates from either the full space or the subspace.

Sam Yuan Sung | Tianming Hu

[1] S. T. Buckland,et al. An Introduction to the Bootstrap. , 1994 .

[2] Douglas H. Fisher,et al. Iterative Optimization and Simplification of Hierarchical Clusterings , 1996, J. Artif. Intell. Res..

[3] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[4] Ole Winther,et al. Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm , 2006, Bioinform..

[5] Paola Sebastiani,et al. Cluster analysis of gene expression dynamics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6] Leslie G. Valiant,et al. A theory of the learnable , 1984, STOC '84.

[7] Robert Tibshirani,et al. Estimating the number of clusters in a data set via the gap statistic , 2000 .

[8] Adil M. Bagirov,et al. New algorithms for multi-class cancer diagnosis using tumor gene expression signatures , 2003, Bioinform..

[9] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[10] Peter C. Cheeseman,et al. Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[11] Anil K. Jain,et al. Combining multiple weak clusterings , 2003, Third IEEE International Conference on Data Mining.

[12] Ana L. N. Fred,et al. Evidence Accumulation Clustering Based on the K-Means Algorithm , 2002, SSPR/SPR.

[13] Andreas Stafylopatis,et al. A Multi-clustering Fusion Algorithm , 2002, SETN.

[14] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[15] H. Bock. On some significance tests in cluster analysis , 1985 .

[16] J. Hartigan. Asymptotic Distributions for Clustering Criteria , 1978 .

[17] Ludmila I. Kuncheva,et al. Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[18] Jill P. Mesirov,et al. Class prediction and discovery using gene expression data , 2000, RECOMB '00.

[19] J. Mesirov,et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[20] Yoav Freund,et al. Experiments with a New Boosting Algorithm , 1996, ICML.

[21] Hillol Kargupta,et al. Distributed Clustering Using Collective Principal Component Analysis , 2001, Knowledge and Information Systems.

[22] D. Botstein,et al. Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[23] A. F. Smith,et al. Statistical analysis of finite mixture distributions , 1986 .

[24] Peter J. Rousseeuw,et al. Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[25] Ana L. N. Fred,et al. Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Boris Mirkin,et al. Mathematical Classification and Clustering: From How to What and Why , 1998 .

[27] R. Schapire. The Strength of Weak Learnability , 1990, Machine Learning.

[28] Jill P. Mesirov,et al. Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[29] Isabelle Guyon,et al. A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[30] M. Levandowsky,et al. Distance between Sets , 1971, Nature.

[31] Paul S. Bradley,et al. Initialization of Iterative Refinement Clustering Algorithms , 1998, KDD.

[32] A. Orth,et al. Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[33] S. Dudoit,et al. A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[34] Kurt Hornik,et al. A CLUE for CLUster Ensembles , 2005 .

[35] George Karypis,et al. Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[36] R. Spang,et al. Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[37] J. Mesirov,et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[38] Joydeep Ghosh,et al. Multiclassifier Systems: Back to the Future , 2002, Multiple Classifier Systems.

[39] G. W. Milligan,et al. An examination of procedures for determining the number of clusters in a data set , 1985 .

[40] Andreas Rudolph,et al. Techniques of Cluster Algorithms in Data Mining , 2002, Data Mining and Knowledge Discovery.

[41] Hillol Kargupta,et al. Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[42] A. Raftery,et al. Model-based Gaussian and non-Gaussian clustering , 1993 .

[43] Ana L. N. Fred,et al. Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[44] E. Lander,et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[45] Teuvo Kohonen,et al. The self-organizing map , 1990 .

[46] Kurt Hornik,et al. Voting-Merging: An Ensemble Method for Clustering , 2001, ICANN.

[47] James C. Bezdek,et al. Relational duals of the c-means clustering algorithms , 1989, Pattern Recognit..

[48] Dan A. Simovici,et al. Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms , 2002, J. Univers. Comput. Sci..

[49] Robert J. McEliece,et al. The Theory of Information and Coding , 1979 .

[50] Anil K. Jain,et al. Clustering ensembles: models of consensus and weak partitions , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[52] Shashi Shekhar,et al. Multilevel hypergraph partitioning: application in VLSI domain , 1997, DAC.

[53] Andreas Stafylopatis,et al. A clustering method based on boosting , 2004, Pattern Recognit. Lett..

[54] David Maxwell Chickering,et al. Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables , 1997, Machine Learning.

[55] Steven Skiena,et al. Integrating Microarray Data By Consensus Clustering , 2004, Int. J. Artif. Intell. Tools.

[56] J. Friedman. Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[57] William M. Rand,et al. Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[58] Eytan Domany,et al. Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[59] Amanda J. C. Sharkey,et al. Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems , 1999 .

[60] J. Downing,et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. , 2002, Cancer cell.

[61] G. W. Milligan,et al. A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[62] Sandrine Dudoit,et al. Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[63] M. Meilă. Comparing clusterings---an information based distance , 2007 .

[64] Thomas G. Dietterich. Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[65] Paul E. Green,et al. A Computational Study of Replicated Clustering with an Application to Market Segmentation , 1991 .