Resampling Method for Unsupervised Estimation of Cluster Validity

We introduce a method for validation of results obtained by clustering analysis of data. The method is based on resampling the available data. A figure of merit that measures the stability of clustering solutions against resampling is introduced. Clusters that are stable against resampling give rise to local maxima of this figure of merit. This is presented first for a one-dimensional data set, for which an analytic approximation for the figure of merit is derived and compared with numerical measurements. Next, the applicability of the method is demonstrated for higher-dimensional data, including gene microarray expression data.

[1]  Michael Ruogu Zhang,et al.  Super-paramagnetic clustering of yeast gene expression profiles , 1999, physics/9911038.

[2]  P. Good Resampling Methods , 1999, Birkhäuser Boston.

[3]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[4]  Anil K. Jain,et al.  Bootstrap technique in cluster analysis , 1987, Pattern Recognit..

[5]  Eytan Domany,et al.  Superparamagnetic Clustering of Data , 1996 .

[6]  Eytan Domany,et al.  Superparamagnetic clustering of data — The definitive solution of an ill-posed problem , 1999 .

[7]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[8]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[10]  H. Bock On some significance tests in cluster analysis , 1985 .

[11]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[12]  Michael P. Windham,et al.  Cluster Validity for the Fuzzy c-Means Clustering Algorithrm , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  James L. Winkler,et al.  Accessing Genetic Information with High-Density DNA Arrays , 1996, Science.

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  James C. Bezdek,et al.  Cluster validation with generalized Dunn's indices , 1995, Proceedings 1995 Second New Zealand International Two-Stream Conference on Artificial Neural Networks and Expert Systems.

[16]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[18]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[19]  M. P. Windham,et al.  Information-Based Validity Functionals for Mixture Analysis , 1994 .

[20]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[21]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[22]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.