Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters

Genome sequencing has led to the discovery of tens of thousands of potential new genes. Six years after the sequencing of the well-studied yeast Saccharomyces cerevisiae and the discovery that its genome encodes ∼6,000 predicted proteins, more than 2,000 have not yet been characterized experimentally, and determining their functions seems far from a trivial task. One crucial constraint is the generation of useful hypotheses about protein function. Using a new approach to interpret microarray data, we assign likely cellular functions with confidence values to these new yeast proteins. We perform extensive genome-wide validations of our predictions and offer visualization methods for exploration of the large numbers of functional predictions. We identify potential new members of many existing functional categories including 285 candidate proteins involved in transcription, processing and transport of non-coding RNA molecules. We present experimental validation confirming the involvement of several of these proteins in ribosomal RNA processing. Our methodology can be applied to a variety of genomics data types and organisms.

[1]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[2]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[3]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[4]  E. Lund,et al.  Diverse effects of the guanine nucleotide exchange factor RCC1 on RNA transport. , 1995, Science.

[5]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[6]  M Aldea,et al.  A Set of Vectors with a Tetracycline‐Regulatable Promoter System for Modulated Gene Expression in Saccharomyces cerevisiae , 1997, Yeast.

[7]  Hans-Werner Mewes,et al.  MIPS: a database for protein sequences, homology data and yeast genome information , 1997, Nucleic Acids Res..

[8]  E. Prosperi Multiple roles of the proliferating cell nuclear antigen: DNA replication, repair and cell cycle control. , 1997, Progress in cell cycle research.

[9]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[10]  J. Berg Genome sequence of the nematode C. elegans: a platform for investigating biology. , 1998, Science.

[11]  Andrew Smith Genome sequence of the nematode C-elegans: A platform for investigating biology , 1998 .

[12]  J. Barker,et al.  Large-scale temporal gene expression mapping of central nervous system development. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[13]  D. Botstein,et al.  The transcriptional program of sporulation in budding yeast. , 1998, Science.

[14]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[15]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[16]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[17]  D Haussler,et al.  Genome-wide bioinformatic and molecular analysis of introns in Saccharomyces cerevisiae. , 1999, RNA.

[18]  James I. Garrels,et al.  The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data , 1999, Nucleic Acids Res..

[19]  Patrick J. Lau,et al.  Saccharomyces cerevisiae pol30(Proliferating Cell Nuclear Antigen) Mutations Impair Replication Fidelity and Mismatch Repair , 1999, Molecular and Cellular Biology.

[20]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[21]  Patrick Linder,et al.  Protein trans-Acting Factors Involved in Ribosome Biogenesis in Saccharomyces cerevisiae , 1999, Molecular and Cellular Biology.

[22]  D. Eisenberg,et al.  A combined algorithm for genome-wide prediction of protein function , 1999, Nature.

[23]  C. Niehrs,et al.  Synexpression groups in eukaryotes , 1999, Nature.

[24]  Ronald W. Davis,et al.  Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. , 1999, Science.

[25]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[26]  R. King,et al.  Yeast Yeast 2000; 17: 283±293. Research Article , 2000 .

[27]  T. Hughes,et al.  Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. , 2000, Science.

[28]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[29]  Hagit Shatkay,et al.  Genes, Themes, and Microarrays: Using Information Retrieval for Large-Scale Gene Analysis , 2000, ISMB.

[30]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[31]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[32]  R. White,et al.  Survey and summary: transcription by RNA polymerases I and III. , 2000, Nucleic acids research.

[33]  Kara Dolinski,et al.  Integrating functional genomic information into the Saccharomyces Genome Database , 2000, Nucleic Acids Res..

[34]  D. Gelperin,et al.  Bms1p, a novel GTP-binding protein, and the related Tsr1p are required for distinct steps of 40S ribosome biogenesis in yeast. , 2001, RNA.

[35]  T. Takagi,et al.  Assessment of prediction accuracy of protein function from protein–protein interaction data , 2001, Yeast.

[36]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[37]  M B Eisen,et al.  Delineating developmental and metabolic pathways in vivo by expression profiling using the RIKEN set of 18,816 full-length enriched mouse cDNA arrays , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Debashis Ghosh,et al.  STATISTICAL ISSUES IN THE CLUSTERING OF GENE EXPRESSION DATA , 2001 .

[39]  P. Grandi,et al.  Identification of a 60S preribosomal particle that is closely linked to nuclear export. , 2001, Molecular cell.

[40]  Joshua M. Stuart,et al.  A Gene Expression Map for Caenorhabditis elegans , 2001, Science.

[41]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.