Prediction of protein subcellular localization

Because the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful for inferring protein functions. Recent years have seen a surging interest in the development of novel computational tools to predict subcellular localization. At present, these approaches, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. A number of authors have noticed that sequence similarity is useful in predicting subcellular localization. For example, Nair and Rost (Protein Sci 2002;11:2836–2847) have carried out extensive analysis of the relation between sequence similarity and identity in subcellular localization, and have found a close relationship between them above a certain similarity threshold. However, many existing benchmark data sets used for the prediction accuracy assessment contain highly homologous sequences—some data sets comprising sequences up to 80–90% sequence identity. Using these benchmark test data will surely lead to overestimation of the performance of the methods considered. Here, we develop an approach based on a two‐level support vector machine (SVM) system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two benchmark data sets—one comprising prokaryotic sequences and the other eukaryotic sequences. Furthermore, we carried out all‐against‐all sequence alignment for several data sets to investigate the relationship between sequence homology and subcellular localization. Our results, which are consistent with previous studies, indicate that the homology search approach performs well down to 30% sequence identity, although its performance deteriorates considerably for sequences sharing lower sequence identity. A data set of high homology levels will undoubtedly lead to biased assessment of the performances of the predictive approaches—especially those relying on homology search or sequence annotations. Our two‐level classification system based on SVM does not rely on homology search; therefore, its performance remains relatively unaffected by sequence homology. When compared with other approaches, our approach performed significantly better. Furthermore, we also develop a practical hybrid method, which combines the two‐level SVM classifier and the homology search method, as a general tool for the sequence annotation of subcellular localization. Proteins 2006. © 2006 Wiley‐Liss, Inc.

[1]  Jie Liang,et al.  Prediction of buried helices in multispan alpha helical membrane proteins , 2006, Proteins.

[2]  Yang Dai,et al.  An SVM-based system for predicting protein subnuclear localizations , 2005, BMC Bioinformatics.

[3]  Jenn-Kang Hwang,et al.  Prediction of disulfide connectivity from protein sequences , 2005, Proteins.

[4]  Gajendra P. S. Raghava,et al.  PSLpred: prediction of subcellular localization of bacterial proteins , 2005, Bioinform..

[5]  M. Bhasin,et al.  Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search* , 2005, Journal of Biological Chemistry.

[6]  Martin Ester,et al.  Sequence analysis PSORTb v . 2 . 0 : Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis , 2004 .

[7]  Kuo-Chen Chou,et al.  Using GO-PseAA predictor to predict enzyme sub-class. , 2004, Biochemical and biophysical research communications.

[8]  Michelle S. Scott,et al.  Predicting subcellular localization via protein motif co-occurrence. , 2004, Genome research.

[9]  Jenn-Kang Hwang,et al.  Prediction of the bonding states of cysteines Using the support vector machines based on multiple feature vectors and cysteine state sequences , 2004, Proteins.

[10]  Jenn-Kang Hwang,et al.  Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[11]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[12]  B. Rost,et al.  Better prediction of sub‐cellular localization by combining evolutionary and structural information , 2003, Proteins.

[13]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[14]  Ke Wang,et al.  PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria , 2003, Nucleic Acids Res..

[15]  Kuo-Chen Chou,et al.  Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. , 2003, Biochemical and biophysical research communications.

[16]  S. Sathiya Keerthi,et al.  Evaluation of simple performance measures for tuning SVM hyperparameters , 2003, Neurocomputing.

[17]  Chih-Jen Lin,et al.  Fine‐grained protein fold assignment by support vector machines using generalized npeptide coding schemes and jury voting from multiple‐parameter sets , 2003, Proteins.

[18]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[19]  Burkhard Rost,et al.  Sequence conserved for subcellular localization , 2002, Protein science : a publication of the Protein Society.

[20]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[21]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[22]  P Tufféry,et al.  Predicting the disulfide bonding state of cysteines using protein descriptors , 2002, Proteins.

[23]  István Simon,et al.  The HMMTOP transmembrane topology prediction server , 2001, Bioinform..

[24]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[25]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[26]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[27]  David P. Kreil,et al.  Identification of thermophilic species by the amino acid compositions deduced from their genomes. , 2001, Nucleic acids research.

[28]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[29]  Chih-Jen Lin,et al.  Formulations of Support Vector Machines: A Note from an Optimization Point of View , 2001, Neural Computation.

[30]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[31]  K. Chou,et al.  Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. , 2000, Biochemical and biophysical research communications.

[32]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[33]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[34]  K. Nakai Protein sorting signals and prediction of subcellular localization. , 2000, Advances in protein chemistry.

[35]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[36]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[37]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[38]  I. Muchnik,et al.  Recognition of a protein fold in the context of the SCOP classification , 1999 .

[39]  I. Muchnik,et al.  Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. , 1999, Proteins.

[40]  Zheng Yuan Prediction of protein subcellular locations using Markov chain models , 1999, FEBS letters.

[41]  K. Chou,et al.  Protein subcellular location prediction. , 1999, Protein engineering.

[42]  G. Heijne,et al.  ChloroP, a neural network‐based method for predicting chloroplast transit peptides and their cleavage sites , 1999, Protein science : a publication of the Protein Society.

[43]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[44]  B. Rost,et al.  Adaptation of protein surfaces to subcellular location. , 1998, Journal of molecular biology.

[45]  P. Aloy,et al.  Relation between amino acid composition and cellular location of proteins. , 1997, Journal of molecular biology.

[46]  S. Brunak,et al.  SHORT COMMUNICATION Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites , 1997 .

[47]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[48]  Cathy H. Wu,et al.  Motif identification neural design for rapid and sensitive protein family search , 1996, Comput. Appl. Biosci..

[49]  S. Brunak,et al.  Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site , 1996, Proteins.

[50]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[51]  S H Kim,et al.  Prediction of protein folding class from amino acid composition , 1993, Proteins.

[52]  M. Kanehisa,et al.  A knowledge base for predicting protein localization sites in eukaryotic cells , 1992, Genomics.

[53]  Cathy H. Wu,et al.  Protein classification artificial neural system , 1992, Protein science : a publication of the Protein Society.

[54]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[55]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[56]  K Nishikawa,et al.  The folding type of a protein is relevant to the amino acid composition. , 1986, Journal of biochemistry.

[57]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[58]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.