Stability-Based Validation of Clustering Solutions

Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract natural group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quantifies the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classification risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classification risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in real-world problems.

[1]  E. Lander Array of hope , 1999, Nature Genetics.

[2]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[3]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[4]  Geoffrey C. Fox,et al.  Vector Quantization by Deterministic , 1992 .

[5]  Roded Sharan,et al.  CLICK: A Clustering Algorithm for Gene Expression Analysis , 2000, ISMB 2000.

[6]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[7]  Amir Assadi,et al.  Unsupervised clustering algorithm for N-dimensional data , 2005, Journal of Neuroscience Methods.

[8]  D. Botstein,et al.  The transcriptional program in the response of human fibroblasts to serum. , 1999, Science.

[9]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[10]  Geoffrey C. Fox,et al.  Vector quantization by deterministic annealing , 1992, IEEE Trans. Inf. Theory.

[11]  J. Breckenridge Replicating Cluster Analysis: Method, Consistency, and Validity. , 1989, Multivariate behavioral research.

[12]  Volker Roth,et al.  Bayesian class discovery in microarray datasets , 2004, IEEE Transactions on Biomedical Engineering.

[13]  Robert Tibshirani,et al.  Cluster Validation by Prediction Strength , 2005 .

[14]  Shaogang Gong,et al.  Model Selection for Unsupervised Learning of Visual Context , 2006, International Journal of Computer Vision.

[15]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Joachim M. Buhmann,et al.  Stability-Based Model Selection , 2002, NIPS.

[17]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[18]  Joachim M. Buhmann,et al.  Path Based Pairwise Data Clustering with Application to Texture Segmentation , 2001, EMMCVPR.

[19]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[21]  Acknowledgments , 2009 .

[22]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[23]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[24]  Tao Jiang,et al.  Algorithmic Approaches to Clustering Gene Expression Data , 2002 .

[25]  David G. Stork,et al.  Pattern Classification , 1973 .

[26]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[27]  Joachim M. Buhmann,et al.  Pairwise Data Clustering by Deterministic Annealing , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[29]  Joachim M. Buhmann,et al.  Data clustering and learning , 1998 .

[30]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[31]  T. Denoeux Pattern Classiication , 1996 .

[32]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[33]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[34]  Joachim M. Buhmann,et al.  Boundary-constrained agglomerative segmentation , 2004, IEEE Transactions on Geoscience and Remote Sensing.

[35]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[36]  A Gordon,et al.  Classification, 2nd Edition , 1999 .

[37]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[38]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[39]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.