Judging the quality of gene expression-based clustering methods using gene annotation.

We compare several commonly used expression-based gene clustering algorithms using a figure of merit based on the mutual information between cluster membership and known gene attributes. By studying various publicly available expression data sets we conclude that enrichment of clusters for biological function is, in general, highest at rather low cluster numbers. As a measure of dissimilarity between the expression patterns of two genes, no method outperforms Euclidean distance for ratio-based measurements, or Pearson distance for non-ratio-based measurements at the optimal choice of cluster number. We show the self-organized-map approach to be best for both measurement types at higher numbers of clusters. Clusters of genes derived from single- and average-linkage hierarchical clustering tend to produce worse-than-random results.

[1]  William H. Press,et al.  Numerical recipes , 1990 .

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  Larry Wall,et al.  Programming Perl , 1991 .

[4]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[5]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[6]  D. Beazley,et al.  Perl Extension Building with SWIG , 1998 .

[7]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[9]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[10]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[11]  G S Michaels,et al.  Cluster analysis and data visualization of large-scale gene expression data. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[12]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[13]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[14]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[15]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[16]  P. Brown,et al.  DNA arrays for analysis of gene expression. , 1999, Methods in enzymology.

[17]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[19]  L S Robertson,et al.  The yeast A kinases differentially regulate iron uptake and respiratory function. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  G. Church,et al.  Systematic management and analysis of yeast gene expression data. , 2000, Genome research.

[21]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[22]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[23]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[24]  M. Gerstein,et al.  The current excitement in bioinformatics-analysis of whole-genome expression data: how does it relate to protein structure and function? , 2000, Current opinion in structural biology.

[25]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[26]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[27]  George M. Church,et al.  Regulatory Networks Revealed by Transcriptional Profiling of Damaged Saccharomyces cerevisiae Cells: Rpn4 Links Base Excision Repair with Proteasomes , 2000, Molecular and Cellular Biology.

[28]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[29]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[30]  Rainer Fuchs,et al.  Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters , 2001, Bioinform..

[31]  L. Jakt,et al.  Assessing clusters and motifs from gene expression data. , 2001, Genome research.

[32]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[33]  Samuel Kaski,et al.  Clustering Based on Conditional Distributions in an Auxiliary Space , 2002, Neural Computation.

[34]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[35]  G. Church,et al.  Discrimination between paralogs using microarray analysis: application to the Yap1p and Yap2p transcriptional networks. , 2002, Molecular biology of the cell.

[36]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[37]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .