Sequence conserved for subcellular localization

The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large‐scale analysis of the relation between sequence similarity and identity in subcellular localization. Three results stood out: (1) The subcellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment; (2) the sequence conservation of localization is similar between different compartments; and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and nonconserved localization to be very sharp, although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP distance only for alignments in the subtwilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine tuning the thresholds. Finally, we applied our results to the entire SWISS‐PROT database and five entirely sequenced eukaryotes.

[1]  Leszek Rychlewski,et al.  Improving the quality of twilight‐zone alignments , 2000, Protein science : a publication of the Protein Society.

[2]  E. Webb Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. , 1992 .

[3]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[4]  R. Doolittle Of urfs and orfs : a primer on how to analyze devised amino acid sequences , 1986 .

[5]  M A Sirover,et al.  New insights into an old protein: the functional diversity of mammalian glyceraldehyde-3-phosphate dehydrogenase. , 1999, Biochimica et biophysica acta.

[6]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[7]  C. Chothia,et al.  Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[9]  B. Rost,et al.  Finding nuclear localization signals , 2000, EMBO reports.

[10]  B. Rost,et al.  Marrying structure and genomics. , 1998, Structure.

[11]  I. Mattaj,et al.  Nucleocytoplasmic transport: the soluble phase. , 1998, Annual review of biochemistry.

[12]  U. Hobohm,et al.  Selection of representative protein data sets , 1992, Protein science : a publication of the Protein Society.

[13]  A. Godzik,et al.  Functional insights from structural predictions: Analysis of the Escherichia coli genome , 2008, Protein science : a publication of the Protein Society.

[14]  A. Murzin How far divergent evolution goes in proteins. , 1998, Current opinion in structural biology.

[15]  M. Montenarh,et al.  Subcellular localization of protein kinase CK2 , 2000, Cell and Tissue Research.

[16]  W. Neupert,et al.  Mitochondrial protein import: mechanisms, components and energetics. , 1994, Biochimica et biophysica acta.

[17]  B. Dobberstein,et al.  Common Principles of Protein Translocation Across Membranes , 1996, Science.

[18]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[19]  B. Rost,et al.  Protein structures sustain evolutionary drift. , 1997, Folding & design.

[20]  S. Altschul A protein alignment scoring system sensitive at all evolutionary distances , 1993, Journal of Molecular Evolution.

[21]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[22]  P Bork,et al.  Wanted: subcellular localization of proteins based on sequence. , 1998, Trends in cell biology.

[23]  G. Fiucci,et al.  Localization and Possible Functions of Phospholipase D Isozymes , 1999 .

[24]  C. Sander,et al.  Computational comparisons of model genomes. , 1996, Trends in biotechnology.

[25]  B. Honig,et al.  An integrated approach to the analysis and modeling of protein sequences and structures. III. A comparative study of sequence conservation in protein structural families using multiple structural alignments. , 2000, Journal of molecular biology.

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[28]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[29]  K. Nakai Review: prediction of in vivo fates of proteins in the era of genomics and proteomics. , 2001, Journal of structural biology.

[30]  M. Ashburner,et al.  Annotating eukaryote genomes. , 2000, Current opinion in structural biology.

[31]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[32]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[33]  Frances M. G. Pearl,et al.  Protein folds, functions and evolution. , 1999, Journal of molecular biology.

[34]  M. Ashburner,et al.  FlyBase--the Drosophila genetic database. , 1994, Development.

[35]  S. Brunak,et al.  Prediction of N-terminal protein sorting signals. , 1997, Current opinion in structural biology.

[36]  P. Argos,et al.  An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. , 1995, Journal of molecular biology.

[37]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[38]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[39]  F E Cohen,et al.  Pairwise sequence alignment below the twilight zone. , 2001, Journal of molecular biology.

[40]  Jens G. Reich,et al.  A simple statistical significance test of window scores in large dot matrices obtained from protein or nucleic acid sequences , 1987, Comput. Appl. Biosci..

[41]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[42]  Chris Sander,et al.  EUCLID: automatic classification of proteins in functional classes by their database annotations , 1998, Bioinform..

[43]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[44]  N N Alexandrov,et al.  Statistical significance of ungapped sequence alignments. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[45]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[46]  C. Sander,et al.  Yeast chromosome III: new gene functions. , 1994, The EMBO journal.

[47]  Reinhard Schneider,et al.  GeneCrunch: Experiences on the SGI POWER CHALLENGEarray with Bioinformatics applications , 1996 .

[48]  W. Pearson,et al.  Evolution of protein sequences and structures. , 1999, Journal of molecular biology.

[49]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[50]  R. Abagyan,et al.  Do aligned sequences share the same fold? , 1997, Journal of molecular biology.

[51]  B. Rost,et al.  Comparing function and structure between entire proteomes , 2001, Protein science : a publication of the Protein Society.

[52]  C. Sander,et al.  Challenging times for bioinformatics , 1995, Nature.

[53]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[54]  Lawrence Hunter,et al.  Predicting Enzyme Function from Sequence: A Systematic Appraisal , 1997, ISMB.

[55]  Volker A. Eyrich,et al.  EVA: Large‐scale analysis of secondary structure prediction , 2001, Proteins.

[56]  C. Chothia,et al.  Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[57]  B Honig,et al.  An integrated approach to the analysis and modeling of protein sequences and structures. II. On the relationship between sequence and structural similarity for proteins that are not obviously related in sequence. , 2000, Journal of molecular biology.

[58]  P D Karp,et al.  What we do not know about sequence analysis and sequence databases. , 1998, Bioinformatics.

[59]  B. Bruce,et al.  Chloroplast transit peptides: structure, function and evolution. , 2000, Trends in cell biology.

[60]  A. Godzik,et al.  Sensitive sequence comparison as protein function predictor. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[61]  M. Gerstein,et al.  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. , 2000, Journal of molecular biology.

[62]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[63]  H Nielsen,et al.  Machine learning approaches for the prediction of signal peptides and other protein sorting signals. , 1999, Protein engineering.

[64]  Chris Sander,et al.  Completeness in structural genomics , 2001, Nature Structural Biology.

[65]  E V Koonin,et al.  Bridging the gap between sequence and function. , 2000, Trends in genetics : TIG.

[66]  D. Pearce Localization and processing of CLN3, the protein associated to batten disease: Where is it and what does it do? , 2000, Journal of neuroscience research.

[67]  B. Rost,et al.  Alignments grow, secondary structure prediction improves , 2002, Proteins.

[68]  M. Gerstein,et al.  A Bayesian system integrating expression data with sequence patterns for localizing proteins: comprehensive application to the yeast genome. , 2000, Journal of molecular biology.

[69]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[70]  A. Valencia,et al.  Practical limits of function prediction , 2000, Proteins.

[71]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[72]  Miguel A. Andrade-Navarro,et al.  Automated genome sequence analysis and annotation , 1999, Bioinform..

[73]  M. Montenarh,et al.  Subcellular localization of protein kinase CK2. A key to its function? , 2000, Cell and tissue research.

[74]  P. Silver,et al.  Regulated nuclear localization of stress-responsive factors: how the nuclear trafficking of protein kinases and transcription factors contributes to cell survival , 1999, Oncogene.

[75]  M Gerstein,et al.  Advances in structural genomics. , 1999, Current opinion in structural biology.

[76]  B. Rost,et al.  Adaptation of protein surfaces to subcellular location. , 1998, Journal of molecular biology.