Comparisons and validation of statistical clustering techniques for microarray gene expression data

MOTIVATION With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation 'distance' has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles. RESULTS In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the 'best' method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes. AVAILABILITY S+ codes for the partial least squares based clustering are available from the authors upon request. All other clustering methods considered have S+ implementation in the library MASS. S+ codes for calculating the validation measures are available from the authors upon request. The sporulation data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation

[1]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[2]  Calvin L. Williams,et al.  Modern Applied Statistics with S-Plus , 1997 .

[3]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[4]  M. Stone Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least s , 1990 .

[5]  L. Gleser Measurement, Regression, and Calibration , 1996 .

[6]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[7]  Michael Q. Zhang,et al.  Evaluation and comparison of clustering algorithms in analyzing es cell gene expression data , 2002 .

[8]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[10]  R. Carroll Measurement, Regression, and Calibration , 1994 .

[11]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[12]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[13]  S. Datta,et al.  Exploring relationships in gene expressions: a partial least squares approach. , 2001, Gene expression.

[14]  Geoffrey J. McLachlan,et al.  A mixture model-based approach to the clustering of microarray expression data , 2002, Bioinform..

[15]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[16]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[17]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[18]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[19]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[20]  P. Waddell,et al.  Cluster inference methods and graphical models evaluated on NCI60 microarray gene expression data. , 2000, Genome informatics. Workshop on Genome Informatics.

[21]  M K Kerr,et al.  Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  A. Brazma,et al.  Gene expression data analysis , 2000, FEBS letters.

[23]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[24]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .