Applications of Resampling Methods to Estimate the Number of Clusters and to Improve the Accuracy of

The burgeoning eld of genomics, and in particular microarray experiments, have revived interest in both discriminant and cluster analysis, by raising new methodological and computational challenges. The present paper discusses applications of resampling methods to problems in cluster analysis. A resampling method, known as bagging in discriminant analysis, is applied to increase clustering accuracy and to assess the con dence of cluster assignments for individual observations. A novel prediction-based resampling method is also proposed to estimate the number of clusters, if any, in a dataset. The performance of the proposed and existing methods are compared using simulated data and gene expression data from four recently published cancer microarray studies.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Warren S. Sarle,et al.  Cubic Clustering Criterion , 1983 .

[3]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[4]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[5]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[6]  Ash A. Alizadeh,et al.  Genome-wide analysis of DNA copy-number changes using cDNA microarrays , 1999, Nature Genetics.

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[9]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[10]  J. Hartigan Statistical theory in clustering , 1985 .

[11]  S. Dudoit,et al.  Comparison of discrimination methods for the classification of tumors using gene expression data , 2002 .

[12]  J. Breckenridge Replicating Cluster Analysis: Method, Consistency, and Validity. , 1989, Multivariate behavioral research.

[13]  Ash A. Alizadeh,et al.  Di erent types of di use large b-cell lymphoma identi ed by gene expression pro ling , 2000 .

[14]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .

[16]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[17]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[18]  N. Sampas,et al.  malignant melanoma by gene expression pro ® , 2022 .

[19]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[21]  Terence P. Speed,et al.  Normalization for cDNA microarry data , 2001, SPIE BiOS.

[22]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[23]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[24]  H. Bock On some significance tests in cluster analysis , 1985 .

[25]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[26]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[27]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[28]  J. Hartigan Asymptotic Distributions for Clustering Criteria , 1978 .

[29]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .