Estimating Number of Clusters Based on a General Similarity Matrix with Application to Microarray Data

Many clustering methods require that the number of clusters believed present in a given data set be specified a priori, and a number of methods for estimating the number of clusters have been developed. However, the selection of the number of clusters is well recognized as a difficult and open problem and there is a need for methods which can shed light on specific aspects of the data. This paper adopts a model for clustering based on a specific structure for a similarity matrix. Publicly available gene expression data sets are analyzed to illustrate the method and the performance of our method is assessed by simulation.

[1]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[2]  G. Wagner,et al.  The road to modularity , 2007, Nature Reviews Genetics.

[3]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[4]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[5]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[6]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[7]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[8]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[9]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[10]  F. Juhász On the theoretical backgrounds of cluster analysis based on the eigenvalue problem of the association matrix , 1989 .

[11]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[12]  Cheryl Wolting,et al.  Cluster analysis of protein array results via similarity of Gene Ontology annotation , 2006, BMC Bioinformatics.

[13]  I. Johnstone,et al.  Sparse Principal Components Analysis , 2009, 0901.4392.

[14]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[15]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .