PSLDoc: Protein subcellular localization prediction based on gapped‐dipeptides and probabilistic latent semantic analysis

Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram‐negative bacteria. We present PSLDoc, a method based on gapped‐dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped‐dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped‐dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one‐versus‐rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836–2847; Yu et al., Proteins 2006;64:643–651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low‐ or high‐homology data sets. PSLDoc's overall accuracy of low‐ and high‐homology data sets reaches 86.84% and 98.21%, respectively, and it compares favorably with that of CELLO II (Yu et al., Proteins 2006;64:643–651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005;21:617–623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio‐cluster.iis.sinica.edu.tw/∼bioapp/PSLDoc/. Proteins 2008. © 2008 Wiley‐Liss, Inc.

[1]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[2]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[3]  Martin B Ulmschneider,et al.  Properties of integral membrane protein structures: Derivation of an implicit membrane potential , 2005, Proteins.

[4]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[5]  Alexei A. Efros,et al.  Discovering object categories in image collections , 2005 .

[6]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[7]  Doron Gerber,et al.  Specificity in Transmembrane Helix-Helix Interactions Mediated by Aromatic Residues* , 2007, Journal of Biological Chemistry.

[8]  K. Chou,et al.  Protein subcellular location prediction. , 1999, Protein engineering.

[9]  Alex Alves Freitas,et al.  Comparing Several Approaches for Hierarchical Classification of Proteins with Decision Trees , 2007, BSB.

[10]  Ming-Tat Ko,et al.  Amino acid coupling patterns in thermophilic proteins , 2005, Proteins.

[11]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[12]  Judith Klein-Seetharaman,et al.  PROTEINS: Structure, Function, and Bioinformatics 58:955–970 (2005) Protein Classification Based on Text Document Classification Techniques , 2022 .

[13]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[14]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[15]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[16]  Burkhard Rost,et al.  Sequence conserved for subcellular localization , 2002, Protein science : a publication of the Protein Society.

[17]  Zhiyong Lu,et al.  Predicting subcellular localization of proteins using machine-learned classifiers , 2004, Bioinform..

[18]  Jenn-Kang Hwang,et al.  Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[19]  Martin Ester,et al.  Sequence analysis PSORTb v . 2 . 0 : Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis , 2004 .

[20]  Gajendra P. S. Raghava,et al.  PSLpred: prediction of subcellular localization of bacterial proteins , 2005, Bioinform..

[21]  Bill C White,et al.  Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases , 2003, BMC Bioinformatics.

[22]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[23]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[24]  Raúl E. Valdés-Pérez,et al.  Concise, intelligible, and approximate profiling of multiple classes , 2000, Int. J. Hum. Comput. Stud..

[25]  Jenn-Kang Hwang,et al.  Prediction of protein subcellular localization , 2006, Proteins.

[26]  J. Tommassen,et al.  Assembly Factor Omp85 Recognizes Its Outer Membrane Protein Substrates by a Species-Specific C-Terminal Motif , 2006, PLoS biology.

[27]  K. Nakai,et al.  PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. , 1999, Trends in biochemical sciences.

[28]  Wen-Lian Hsu,et al.  HYPROSP II-A knowledge-based hybrid method for protein secondary structure prediction based on local prediction confidence , 2005, Bioinform..

[29]  Wen-Lian Hsu,et al.  HYPROSP: a hybrid protein secondary structure prediction algorithm--a knowledge-based approach. , 2004, Nucleic acids research.

[30]  B. Rost,et al.  Mimicking cellular sorting improves prediction of subcellular localization. , 2005, Journal of molecular biology.

[31]  HofmannThomas Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2001 .

[32]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[33]  Brian R. King,et al.  ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes , 2007, Genome biology.

[34]  B. Rost,et al.  Better prediction of sub‐cellular localization by combining evolutionary and structural information , 2003, Proteins.

[35]  T. Hubbard,et al.  Using neural networks for prediction of the subcellular location of proteins. , 1998, Nucleic acids research.

[36]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[37]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[38]  P. Aloy,et al.  Relation between amino acid composition and cellular location of proteins. , 1997, Journal of molecular biology.

[39]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[40]  Jianhui Luo,et al.  Experiments on Supervised Learning Algorithms for Text Categorization , 2005, 2005 IEEE Aerospace Conference.

[41]  Ke Wang,et al.  PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria , 2003, Nucleic Acids Res..

[42]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[43]  G. Heijne,et al.  Recognition of transmembrane helices by the endoplasmic reticulum translocon , 2005, Nature.

[44]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[45]  I. Booth,et al.  Regulation of cytoplasmic pH in bacteria. , 1985, Microbiological reviews.

[46]  M. Bhasin,et al.  Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search* , 2005, Journal of Biological Chemistry.

[47]  Christophe G. Lambert,et al.  PSORTdb: a protein subcellular localization database for bacteria , 2004, Nucleic Acids Res..

[48]  Ankush Gupta,et al.  Latent Semantic Indexing based Intelligent Information Retrieval System for Digital Libraries , 2006, J. Comput. Inf. Technol..

[49]  Wing-Kin Sung,et al.  Protein subcellular localization prediction for Gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines , 2005, BMC Bioinformatics.

[50]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[51]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.