Better prediction of sub‐cellular localization by combining evolutionary and structural information

The native sub‐cellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code‐like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four‐state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra‐cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS‐PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB‐based system along with homology‐based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). We imagine that this pilot method—certainly in combination with similar tools—may be valuable target selection in structural genomics. Proteins 2003;53:000–000. © 2003 Wiley‐Liss, Inc.

[1]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[2]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[3]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[4]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[5]  K. Nishikawa,et al.  Classification of proteins into groups based on amino acid composition and other characters. I. Angular distribution. , 1983, Journal of biochemistry.

[6]  K. Nishikawa,et al.  Classification of proteins into groups based on amino acid composition and other characters. II. Grouping into four types. , 1983, Journal of biochemistry.

[7]  M. L. Connolly Solvent-accessible surfaces of proteins and nucleic acids. , 1983, Science.

[8]  M. Kanehisa,et al.  Cluster analysis of amino acid indices for prediction of protein structure and function. , 1988, Protein engineering.

[9]  V. Brumfeld,et al.  Structural distinction between soluble and particulate protein kinase C species , 1990, Journal of protein chemistry.

[10]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[11]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[12]  Chris Sander,et al.  Jury returns on structure prediction , 1992, Nature.

[13]  B. Rost,et al.  Improved prediction of protein secondary structure by use of sequence profiles and neural networks. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[14]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[15]  B. Rost,et al.  Combining evolutionary information and neural networks to predict protein secondary structure , 1994, Proteins.

[16]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[17]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[18]  K Nishikawa,et al.  Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. , 1994, Journal of molecular biology.

[19]  Protein sorting signals: simple peptides with complex functions. , 1995, EXS.

[20]  S. Mahalingam,et al.  Functional analysis of HIV-1 Vpr: identification of determinants essential for subcellular localization. , 1995, Virology.

[21]  D A Kendall,et al.  Protein transport via amino-terminal targeting sequences: common themes in diverse systems. , 1995, Molecular membrane biology.

[22]  B. Rost PHD: predicting one-dimensional protein structure by profile-based neural networks. , 1996, Methods in enzymology.

[23]  Paul Horton,et al.  A Probabilistic Classification System for Predicting the Cellular Localization Sites of Proteins , 1996, ISMB.

[24]  B Rost,et al.  Bridging the protein sequence-structure gap by structure predictions. , 1996, Annual review of biophysics and biomolecular structure.

[25]  B. Dobberstein,et al.  Common Principles of Protein Translocation Across Membranes , 1996, Science.

[26]  P Vincens,et al.  Computational method to predict mitochondrially imported proteins and their targeting sequences. , 1996, European journal of biochemistry.

[27]  S. Brunak,et al.  Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site , 1996, Proteins.

[28]  A A Salamov,et al.  Protein secondary structure prediction using local alignments. , 1997, Journal of molecular biology.

[29]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[30]  P. Aloy,et al.  Relation between amino acid composition and cellular location of proteins. , 1997, Journal of molecular biology.

[31]  Paul Horton,et al.  Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier , 1997, ISMB.

[32]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[33]  Søren Brunak,et al.  A Neural Network Method for Identification of Prokaryotic and Eukaryotic Signal Peptides and Prediction of their Cleavage Sites , 1997, Int. J. Neural Syst..

[34]  I. Mattaj,et al.  Nucleocytoplasmic transport: the soluble phase. , 1998, Annual review of biochemistry.

[35]  P Bork,et al.  Wanted: subcellular localization of proteins based on sequence. , 1998, Trends in cell biology.

[36]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[37]  B. Rost,et al.  Adaptation of protein surfaces to subcellular location. , 1998, Journal of molecular biology.

[38]  K. Weis,et al.  Importins and exportins: how to get in and out of the nucleus. , 1998, Trends in biochemical sciences.

[39]  Chris Sander,et al.  EUCLID: automatic classification of proteins in functional classes by their database annotations , 1998, Bioinform..

[40]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[41]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[42]  Zheng Yuan Prediction of protein subcellular locations using Markov chain models , 1999, FEBS letters.

[43]  K Karplus,et al.  Predicting protein structure using only sequence information , 1999, Proteins.

[44]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[45]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[46]  G. Heijne,et al.  ChloroP, a neural network‐based method for predicting chloroplast transit peptides and their cleavage sites , 1999, Protein science : a publication of the Protein Society.

[47]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[48]  E V Koonin,et al.  Bridging the gap between sequence and function. , 2000, Trends in genetics : TIG.

[49]  M. Ashburner,et al.  Annotating eukaryote genomes. , 2000, Current opinion in structural biology.

[50]  B. Rost,et al.  Finding nuclear localization signals , 2000, EMBO reports.

[51]  K. Chou,et al.  Using neural networks for prediction of subcellular location of prokaryotic and eukaryotic proteins. , 2000, Molecular cell biology research communications : MCBRC.

[52]  C. Sensen,et al.  MAGPIE/EGRET annotation of the 2.9-Mb Drosophila melanogaster Adh region. , 2000, Genome research.

[53]  K. Nakai Protein sorting signals and prediction of subcellular localization. , 2000, Advances in protein chemistry.

[54]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[55]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[56]  M Ouali,et al.  Cascaded multiple classifiers for secondary structure prediction , 2000, Protein science : a publication of the Protein Society.

[57]  K. Chou,et al.  Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. , 2000, Biochemical and biophysical research communications.

[58]  M. Gerstein Annotation of the Human Genome , 2000, Science.

[59]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[60]  D. Eisenberg,et al.  Protein function in the post-genomic era , 2000, Nature.

[61]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[62]  M. Gerstein,et al.  A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. , 2000, Journal of molecular biology.

[63]  D. Eisenberg,et al.  Localizing proteins in the cell from their phylogenetic profiles. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[64]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[65]  M. Ashburner,et al.  A biologist's view of the Drosophila genome annotation assessment project. , 2000, Genome research.

[66]  M. Kamata,et al.  Two Putative α-Helical Domains of Human Immunodeficiency Virus Type 1 Vpr Mediate Nuclear Localization by at Least Two Mechanisms , 2000, Journal of Virology.

[67]  I. Vorberg,et al.  Deletion of β-Strand and α-Helix Secondary Structure in Normal Prion Protein Inhibits Formation of Its Protease-Resistant Isoform , 2001, Journal of Virology.

[68]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[69]  Marc A. Martí-Renom,et al.  EVA: continuous automatic evaluation of protein structure prediction servers , 2001, Bioinform..

[70]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[71]  G von Heijne,et al.  Prediction of organellar targeting signals. , 2001, Biochimica et biophysica acta.

[72]  K. Nakai Review: prediction of in vivo fates of proteins in the era of genomics and proteomics. , 2001, Journal of structural biology.

[73]  B. Rost,et al.  Comparing function and structure between entire proteomes , 2001, Protein science : a publication of the Protein Society.

[74]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[75]  M. Hodel,et al.  Dissection of a Nuclear Localization Signal* , 2001, The Journal of Biological Chemistry.

[76]  E V Koonin Computational genomics , 2001, Current Biology.

[77]  K. Chou,et al.  Support vector machines for prediction of protein subcellular location by incorporating quasi‐sequence‐order effect , 2002, Journal of cellular biochemistry.

[78]  Burkhard Rost,et al.  Inferring sub-cellular localization through automated lexical analysis , 2002, ISMB.

[79]  B. Rost,et al.  Alignments grow, secondary structure prediction improves , 2002, Proteins.

[80]  M. Gerstein,et al.  Subcellular localization of the yeast proteome. , 2002, Genes & development.

[81]  Burkhard Rost,et al.  Sequence conserved for subcellular localization , 2002, Protein science : a publication of the Protein Society.

[82]  Burkhard Rost,et al.  Target space for structural genomics revisited , 2002, Bioinform..

[83]  Peer Bork,et al.  Predicting protein cellular localization using a domain projection method. , 2002, Genome research.

[84]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[85]  Burkhard Rost,et al.  Did evolution leap to create the protein universe? , 2002, Current opinion in structural biology.

[86]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[87]  K-L Ting,et al.  Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence , 2002, Proteins.

[88]  Søren Brunak,et al.  NESbase version 1.0: a database of nuclear export signals , 2003, Nucleic Acids Res..

[89]  Burkhard Rost,et al.  LOC3D: annotate sub-cellular localization for protein structures , 2003, Nucleic Acids Res..

[90]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[91]  Burkhard Rost,et al.  NLSdb: database of nuclear localization signals , 2003, Nucleic Acids Res..