Discovering statistically significant biclusters in gene expression data

In gene expression data, a bicluster is a subset of the genes exhibiting consistent patterns over a subset of the conditions. We propose a new method to detect significant biclusters in large expression datasets. Our approach is graph theoretic coupled with statistical modelling of the data. Under plausible assumptions, our algorithm is polynomial and is guaranteed to find the most significant biclusters. We tested our method on a collection of yeast expression profiles and on a human cancer dataset. Cross validation results show high specificity in assigning function to genes based on their biclusters, and we are able to annotate in this way 196 uncharacterized yeast genes. We also demonstrate how the biclusters lead to detecting new concrete biological associations. In cancer data we are able to detect and relate finer tissue types than was previously possible. We also show that the method outperforms the biclustering algorithm of Cheng and Church (2000).

[1]  Walter L. Smith Probability and Statistics , 1959, Nature.

[2]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[3]  M. Golummc Algorithmic graph theory and perfect graphs , 1980 .

[4]  M. Golumbic Algorithmic graph theory and perfect graphs , 1980 .

[5]  M. Yamada,et al.  Y′-Help1, a DNA Helicase Encoded by the Yeast Subtelomeric Y′ Element, Is Induced in Survivors Defective for Telomerase* , 1998, The Journal of Biological Chemistry.

[6]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[7]  Dorit S. Hochbaum,et al.  Approximating Clique and Biclique Problems , 1998, J. Algorithms.

[8]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[9]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[10]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[11]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[12]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[13]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[14]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[15]  D. Botstein,et al.  Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. , 2001, Molecular biology of the cell.

[16]  Ben Taskar,et al.  Rich probabilistic models for gene expression , 2001, ISMB.

[17]  Ron Shamir,et al.  Computational expansion of genetic networks , 2001, ISMB.

[18]  Roger E Bumgarner,et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. , 2001, Science.

[19]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[20]  R. Sharan,et al.  Cluster analysis and its applications to gene expression data. , 2002, Ernst Schering Research Foundation workshop.