A Stability Based Method for Discovering Structure in Clustered Data

We present a method for visually and quantitatively assessing the presence of structure in clustered data. The method exploits measurements of the stability of clustering solutions obtained by perturbing the data set. Stability is characterized by the distribution of pairwise similarities between clusterings obtained from sub samples of the data. High pairwise similarities indicate a stable clustering pattern. The method can be used with any clustering algorithm; it provides a means of rationally defining an optimum number of clusters, and can also detect the lack of structure in data. We show results on artificial and microarray data using a hierarchical clustering algorithm.

[1]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[2]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[3]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  I. Guyon,et al.  Detecting stable clusters using principal component analysis. , 2003, Methods in molecular biology.

[5]  André Elisseeff,et al.  Algorithmic Stability and Generalization Performance , 2000, NIPS.

[6]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[7]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[8]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[10]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[11]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .

[12]  Richard C. Dubes,et al.  Stability of a hierarchical clustering , 1980, Pattern Recognit..

[13]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[14]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[15]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[17]  A. Dunker The pacific symposium on biocomputing , 1998 .