A Method to Identify Significant Clusters in Gene Expression Data

Clustering algorithms have been widely applied to gene expression data. For both hierarchical and partitioning clustering algorithms, selecting the number of significant clusters is an important problem and many methods have been proposed. Existing methods for selecting the number of clusters tend to find only the global patterns in the data (e.g.: the over and under expressed genes). We have noted the need for a better method in the gene expression context, where small, biologically meaningful clusters can be difficult to identify. In this paper, we define a new criteria, Mean Split Silhouette (MSS), which is a measure of cluster heterogeneity. We propose to choose the number of clusters as the minimizer of MSS. In this way, the number of significant clusters is defined as that which produces the most homogeneous clusters. The power of this method compared to existing methods is demonstrated on simulated microarray data. The minimum MSS method is an example of a general approach that can be applied to any clustering routine with any global criteria.

[1]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[2]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[3]  Mark J. van der Laan,et al.  Hybrid Clustering of Gene Expression Data with Visualization and the Bootstrap , 2001 .

[4]  M J van der Laan,et al.  Gene expression analysis with the parametric bootstrap. , 2001, Biostatistics.

[5]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[6]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[7]  A. Scott,et al.  Clustering methods based on likelihood ratio criteria. , 1971 .

[8]  Sandrine Dudoit,et al.  Applications of Resampling Methods to Estimate the Number of Clusters and to Improve the Accuracy of , 2001 .

[9]  M. J. van der Laan,et al.  A new partitioning around medoids algorithm , 2003 .

[10]  M. J. van der Laan,et al.  Statistical inference for simultaneous clustering of gene expression data. , 2002, Mathematical biosciences.

[11]  Mark J. van der Laan,et al.  Paired and Unpaired Comparisons and Clustering with Gene Expression Data , 2001 .

[12]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[13]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[14]  P. Törönen,et al.  Analysis of gene expression data using self‐organizing maps , 1999, FEBS letters.