Comparing algorithms for clustering of expression data: how to assess gene clusters.

Clustering is a popular technique commonly used to search for groups of similarly expressed genes using mRNA expression data. There are many different clustering algorithms and the application of each one will usually produce different results. Without additional evaluation, it is difficult to determine which solutions are better.In this chapter we discuss methods to assess algorithms for clustering of gene expression data. In particular, we present a new method that uses two elements: an internal index of validity based on the MDL principle and an external index of validity that measures the consistency with experimental data. Each one is used to suggest an effective set of models, but it is only the combination of both that is capable of pinpointing the best model overall. Our method can be used to compare different clustering algorithms and pick the one that maximizes the correlation with functional links in gene networks while minimizing the error rate. We test our methods on several popular clustering algorithms as well as on clustering algorithms that are specially tailored to deal with noisy data. Finally, we propose methods for assessing the significance of individual clusters and study the correspondence between gene clusters and biochemical pathways.

[1]  Francisco Azuaje,et al.  Machaon CVE: cluster validation for gene expression data , 2003, Bioinform..

[2]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  R. Altman,et al.  Using text analysis to identify functionally coherent gene groups. , 2002, Genome research.

[5]  Ying Xu,et al.  Cubic: Identification of Regulatory Binding Sites through Data Clustering , 2003, J. Bioinform. Comput. Biol..

[6]  Hongyu Zhao,et al.  Assessing reliability of gene clusters from gene expression data , 2000, Functional & Integrative Genomics.

[7]  Larry V McIntire,et al.  Microarray analysis of shear stressed endothelial cells. , 2003, Biorheology.

[8]  Richard M. Simon,et al.  Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data , 2002, Bioinform..

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  Ran El-Yaniv,et al.  A New Nonparametric Pairwise Clustering Algorithm Based on Iterative Estimation of Distance Profiles , 2004, Machine Learning.

[11]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[12]  Robert M. Gray,et al.  Locally Optimal Block Quantizer Design , 1980, Inf. Control..

[13]  Patrik D'haeseleer,et al.  How does gene expression clustering work? , 2005, Nature Biotechnology.

[14]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[15]  Andreas Zell,et al.  A memetic clustering algorithm for the functional partition of genes based on the gene ontology , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[16]  Edison T Liu,et al.  Classification of cancers by expression profiling. , 2003, Current opinion in genetics & development.

[17]  David M. Lin,et al.  Effective similarity measures for expression profiles , 2006, Bioinform..

[18]  Francisco Azuaje,et al.  A knowledge-driven approach to cluster validity assessment , 2005, Bioinform..

[19]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  M. Gerstein,et al.  Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. , 2001, Journal of molecular biology.

[21]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[22]  Paul C. Boutros,et al.  Unsupervised pattern recognition: An introduction to the whys and wherefores of clustering microarray data , 2005, Briefings Bioinform..

[23]  H. Chun,et al.  Oxidative stress regulated genes in nigral dopaminergic neuronal cells: correlation with the known pathology in Parkinson's disease. , 2003, Brain research. Molecular brain research.

[24]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[25]  Debashis Ghosh,et al.  Cluster stability scores for microarray data in cancer studies , 2003, BMC Bioinformatics.

[26]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[27]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[28]  Sandrine Dudoit,et al.  Bagging to Improve the Accuracy of A Clustering Procedure , 2003, Bioinform..

[29]  T. Yeatman,et al.  The Future of Clinical Cancer Management: One Tumor, One Chip , 2003, The American surgeon.

[30]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Gill Bejerano Efficient exact value computation and applications to biosequence analysis , 2003, RECOMB '03.

[32]  Giorgio Valentini,et al.  Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses , 2006, Artif. Intell. Medicine.

[33]  S. Gygi,et al.  Correlation between Protein and mRNA Abundance in Yeast , 1999, Molecular and Cellular Biology.

[34]  Golan Yona,et al.  A comprehensive study of the notion of functional link between genes based on microarray data, promoter signals, protein-protein interactions and pathway analysis , 2004 .