A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome.

We develop a probabilistic system for predicting the subcellular localization of proteins and estimating the relative population of the various compartments in yeast. Our system employs a Bayesian approach, updating a protein's probability of being in a compartment, based on a diverse range of 30 features. These range from specific motifs (e.g. signal sequences or the HDEL motif) to overall properties of a sequence (e.g. surface composition or isoelectric point) to whole-genome data (e.g. absolute mRNA expression levels or their fluctuations). The strength of our approach is the easy integration of many features, particularly the whole-genome expression data. We construct a training and testing set of approximately 1300 yeast proteins with an experimentally known localization from merging, filtering, and standardizing the annotation in the MIPS, Swiss-Prot and YPD databases, and we achieve 75 % accuracy on individual protein predictions using this dataset. Moreover, we are able to estimate the relative protein population of the various compartments without requiring a definite localization for every protein. This approach, which is based on an analogy to formalism in quantum mechanics, gives better accuracy in determining relative compartment populations than that obtained by simply tallying the localization predictions for individual proteins (on the yeast proteins with known localization, 92% versus 74%). Our training and testing also highlights which of the 30 features are informative and which are redundant (19 being particularly useful). After developing our system, we apply it to the 4700 yeast proteins with currently unknown localization and estimate the relative population of the various compartments in the entire yeast genome. An unbiased prior is essential to this extrapolated estimate; for this, we use the MIPS localization catalogue, and adapt recent results on the localization of yeast proteins obtained by Snyder and colleagues using a minitransposon system. Our final localizations for all approximately 6000 proteins in the yeast genome are available over the web at: http://bioinfo.mbb.yale. edu/genome/localize.

[1]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[2]  C DeLisi,et al.  The detection and classification of membrane-spanning proteins. , 1985, Biochimica et biophysica acta.

[3]  T. Steitz,et al.  Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. , 1986, Annual review of biophysics and biophysical chemistry.

[4]  G von Heijne,et al.  Net N-C charge imbalance may be important for signal sequence function in bacteria. , 1986, Journal of molecular biology.

[5]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[6]  István Csabai,et al.  Improving signal peptide prediction accuracy by simulated neural network , 1991, Comput. Appl. Biosci..

[7]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[8]  G. Heijne Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. , 1992, Journal of molecular biology.

[9]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[10]  G von Heijne,et al.  Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. , 1992, Journal of molecular biology.

[11]  A. Goffeau,et al.  How many yeast genes code for membrane‐spanning proteins? , 1993, Yeast.

[12]  G. von Heijne,et al.  Predicting the topology of eukaryotic membrane proteins. , 1993, European journal of biochemistry.

[13]  O. Ozier-Kalogeropoulos,et al.  A simple and efficient method for direct gene deletion in Saccharomyces cerevisiae. , 1993, Nucleic acids research.

[14]  S. Ghosh,et al.  Inhibition of NF-kappa B by sodium salicylate and aspirin. , 1994, Science.

[15]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[16]  P. Philippsen,et al.  New heterologous modules for classical or PCR‐based gene disruptions in Saccharomyces cerevisiae , 1994, Yeast.

[17]  B. Rost,et al.  Transmembrane helices predicted at 95% accuracy , 1995, Protein science : a publication of the Protein Society.

[18]  M. Snyder,et al.  A highly divergent gamma-tubulin gene is essential for cell growth and proper microtubule organization in Saccharomyces cerevisiae , 1995, The Journal of cell biology.

[19]  Luciano Milanesi,et al.  Hamming-Clustering method for signals prediction in 5' and 3' regions of eukaryotic genes , 1996, Comput. Appl. Biosci..

[20]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[21]  Ronald W. Davis,et al.  Quantitative phenotypic analysis of yeast deletion mutants using a highly parallel molecular bar–coding strategy , 1996, Nature Genetics.

[22]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[23]  P Vincens,et al.  Computational method to predict mitochondrially imported proteins and their targeting sequences. , 1996, European journal of biochemistry.

[24]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[25]  B. Berger,et al.  MultiCoil: A program for predicting two‐and three‐stranded coiled coils , 1997, Protein science : a publication of the Protein Society.

[26]  Wei Zhou,et al.  Characterization of the Yeast Transcriptome , 1997, Cell.

[27]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[28]  M Gerstein,et al.  A structural census of genomes: comparing bacterial, eukaryotic, and archaeal genomes in terms of protein structure. , 1997, Journal of molecular biology.

[29]  S. Brunak,et al.  Prediction of N-terminal protein sorting signals. , 1997, Current opinion in structural biology.

[30]  A T Brünger,et al.  Are there dominant membrane protein families with a given number of helices? , 1997, Proteins.

[31]  Paul Horton,et al.  Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier , 1997, ISMB.

[32]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[33]  Dmitrij Frishman,et al.  PEDANTic genome analysis , 1997 .

[34]  Mark Borodovsky,et al.  The complete genome sequence of the gastric pathogen Helicobacter pylori , 1997, Nature.

[35]  Søren Brunak,et al.  A Neural Network Method for Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of their Cleavage Sites , 1997, Int. J. Neural Syst..

[36]  G. Heijne,et al.  Genome‐wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms , 1998, Protein science : a publication of the Protein Society.

[37]  David T. Jones Do transmembrane protein superfolds exist? , 1998, FEBS letters.

[38]  Dmitrij Frishman,et al.  MIPS: a database for protein sequences and complete genomes , 1998, Nucleic Acids Res..

[39]  B. Rost,et al.  Adaptation of protein surfaces to subcellular location. , 1998, Journal of molecular biology.

[40]  Dmitrij Frishman,et al.  Comprehensive, comprehensible, distributed and intelligent databases: current status , 1998, Bioinform..

[41]  Michael R. Green,et al.  Dissecting the Regulatory Circuitry of a Eukaryotic Genome , 1998, Cell.

[42]  M. Gerstein,et al.  Comparing genomes in terms of protein structure: surveys of a finite parts list. , 1998, FEMS microbiology reviews.

[43]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[44]  J. Beckwith,et al.  How many membrane proteins are there? , 1998, Protein science : a publication of the Protein Society.

[45]  M. Gerstein How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. , 1998, Folding & design.

[46]  M. Gerstein Patterns of protein‐fold usage in eight microbial genomes: A comprehensive structural census , 1998, Proteins.

[47]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[48]  James I. Garrels,et al.  The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data , 1999, Nucleic Acids Res..

[49]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[50]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 1999, Nucleic Acids Res..

[51]  H Nielsen,et al.  Machine learning approaches for the prediction of signal peptides and other protein sorting signals. , 1999, Protein engineering.

[52]  S. Gygi,et al.  Correlation between Protein and mRNA Abundance in Yeast , 1999, Molecular and Cellular Biology.

[53]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[54]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[55]  Kei-Hoi Cheung,et al.  Large-scale analysis of the yeast genome by transposon tagging and gene disruption , 1999, Nature.

[56]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[57]  M. Gerstein,et al.  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[58]  M Gerstein,et al.  Analysis of the yeast transcriptome with structural and functional categories: characterizing highly expressed proteins. , 2000, Nucleic acids research.

[59]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[60]  M Gerstein,et al.  Protein folds in the worm genome. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.