Scoring clustering solutions by their biological relevance

MOTIVATION A central step in the analysis of gene expression data is the identification of groups of genes that exhibit similar expression patterns. Clustering gene expression data into homogeneous groups was shown to be instrumental in functional annotation, tissue classification, regulatory motif identification, and other applications. Although there is a rich literature on clustering algorithms for gene expression analysis, very few works addressed the systematic comparison and evaluation of clustering results. Typically, different clustering algorithms yield different clustering solutions on the same data, and there is no agreed upon guideline for choosing among them. RESULTS We developed a novel statistically based method for assessing a clustering solution according to prior biological knowledge. Our method can be used to compare different clustering solutions or to optimize the parameters of a clustering algorithm. The method is based on projecting vectors of biological attributes of the clustered elements onto the real line, such that the ratio of between-groups and within-group variance estimators is maximized. The projected data are then scored using a non-parametric analysis of variance test, and the score's confidence is evaluated. We validate our approach using simulated data and show that our scoring method outperforms several extant methods, including the separation to homogeneity ratio and the silhouette measure. We apply our method to evaluate results of several clustering methods on yeast cell-cycle gene expression data. AVAILABILITY The software is available from the authors upon request.

[1]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[2]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[3]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[4]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[5]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Chris Vulpe,et al.  Discriminant analysis to evaluate clustering of gene expression data , 2002, FEBS letters.

[7]  Robert C. Kohberger,et al.  Cluster Analysis (3rd ed.) , 1994 .

[8]  George Stephanopoulos,et al.  Mapping physiological states from microarray expression measurements , 2002, Bioinform..

[9]  Mark J. van der Laan,et al.  A Method to Identify Significant Clusters in Gene Expression Data , 2002 .

[10]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[11]  Brian Everitt,et al.  Cluster analysis , 1974 .

[12]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[13]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[14]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[15]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[16]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[17]  R. Sharan,et al.  Cluster analysis and its applications to gene expression data. , 2002, Ernst Schering Research Foundation workshop.

[18]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[19]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .

[20]  R. Sokal Clustering and Classification: Background and Current Directions , 1977 .

[21]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[22]  Mark J. van der Laan,et al.  A Method to Identify Signicant Clusters in Gene Expression Data , 2002 .

[23]  Ron Shamir,et al.  CLICK and EXPANDER: a system for clustering and visualizing gene expression data , 2003, Bioinform..

[24]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[25]  G. Jogesh Babu,et al.  Multivariate Permutation Tests , 2002, Technometrics.

[26]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[27]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[28]  M McSweeney,et al.  A Multivariate Kruskal-Wallis Test With Post Hoc Procedures. , 1980, Multivariate behavioral research.

[29]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[30]  C. J. Huberty,et al.  Applied Discriminant Analysis , 1994 .

[31]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[32]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[33]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[34]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[35]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[36]  B. Jaumard,et al.  Cluster Analysis and Mathematical Programming , 2003 .