A Machine Learning Based Method for the Prediction of Secretory Proteins Using Amino Acid Composition, Their Order and Similarity-Search

Most of the prediction methods for secretory proteins require the presence of a correct N-terminal end of the preprotein for correct classification. As large scale genome sequencing projects sometimes assign the 5'-end of genes incorrectly, many proteins are encoded without the correct N-terminus leading to incorrect prediction. In this study, a systematic attempt has been made to predict secretory proteins irrespective of presence or absence of N-terminal signal peptides (also known as classical and non-classical secreted proteins respectively), using machine-learning techniques; artificial neural network (ANN) and support vector machine (SVM). We trained and tested our methods on a dataset of 3321 secretory and 3654 non-secretory mammalian proteins using five-fold cross-validation technique. First, ANN-based modules have been developed for predicting secretory proteins using 33 physico-chemical properties, amino acid composition and dipeptide composition and achieved accuracies of 73.1%, 76.1% and 77.1%, respectively. Similarly, SVM-based modules using 33 physico-chemical properties, amino acid, and dipeptide composition have been able to achieve accuracies of 77.4%, 79.4% and 79.9%, respectively. In addition, BLAST and PSI-BLAST modules designed for predicting secretory proteins based on similarity search achieved 23.4% and 26.9% accuracy, respectively. Finally, we developed a hybrid-approach by integrating amino acid and dipeptide composition based SVM modules and PSI-BLAST module that increased the accuracy to 83.2%, which is significantly better than individual modules. We also achieved high sensitivity of 60.4% with low value of 5% false positive predictions using hybrid module. A web server SRTpred has been developed based on above study for predicting classical and non-classical secreted proteins from whole sequence of mammalian proteins, which is available from http://www.imtech.res.in/raghava/srtpred/.

[1]  Gajendra P. S. Raghava,et al.  PSLpred: prediction of subcellular localization of bacterial proteins , 2005, Bioinform..

[2]  S. Brunak,et al.  Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[3]  Gajendra P. S. Raghava,et al.  Quantification of the variation in percentage identity for protein sequence alignments , 2006, BMC Bioinformatics.

[4]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[5]  Zhirong Sun,et al.  Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  S. Lory Secretion of proteins and assembly of bacterial surface organelles: shared pathways of extracellular protein targeting. , 1998, Current opinion in microbiology.

[8]  Stavros J. Hamodrakas,et al.  PredSL: A Tool for the N-terminal Sequence-based Prediction of Protein Subcellular Localization , 2006, Genom. Proteom. Bioinform..

[9]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[10]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[11]  Jenn-Kang Hwang,et al.  Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[12]  M. Bhasin,et al.  Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search* , 2005, Journal of Biological Chemistry.

[13]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[14]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[15]  Douglas N W Cooper,et al.  Galectinomics: finding themes in complexity. , 2002, Biochimica et biophysica acta.

[16]  W. Nickel The mystery of nonclassical protein secretion. A current view on cargo proteins and potential export routes. , 2003, European journal of biochemistry.

[17]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[18]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[19]  R. Hughes Secretion of the galectin family of mammalian carbohydrate-binding proteins. , 1999, Biochimica et biophysica acta.

[21]  Arne Elofsson,et al.  Prediction of MHC class I binding peptides, using SVMHC , 2002, BMC Bioinformatics.

[22]  Chittibabu Guda,et al.  pTARGET: a web server for predicting protein subcellular localization , 2006, Nucleic Acids Res..

[23]  N. Blom,et al.  Feature-based prediction of non-classical and leaderless protein secretion. , 2004, Protein engineering, design & selection : PEDS.

[24]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[26]  Eric W Klee,et al.  Computational classification of classically secreted proteins. , 2007, Drug discovery today.