论文信息 - A Machine Learning Based Method for the Prediction of Secretory Proteins Using Amino Acid Composition, Their Order and Similarity-Search

A Machine Learning Based Method for the Prediction of Secretory Proteins Using Amino Acid Composition, Their Order and Similarity-Search

Most of the prediction methods for secretory proteins require the presence of a correct N-terminal end of the preprotein for correct classification. As large scale genome sequencing projects sometimes assign the 5'-end of genes incorrectly, many proteins are encoded without the correct N-terminus leading to incorrect prediction. In this study, a systematic attempt has been made to predict secretory proteins irrespective of presence or absence of N-terminal signal peptides (also known as classical and non-classical secreted proteins respectively), using machine-learning techniques; artificial neural network (ANN) and support vector machine (SVM). We trained and tested our methods on a dataset of 3321 secretory and 3654 non-secretory mammalian proteins using five-fold cross-validation technique. First, ANN-based modules have been developed for predicting secretory proteins using 33 physico-chemical properties, amino acid composition and dipeptide composition and achieved accuracies of 73.1%, 76.1% and 77.1%, respectively. Similarly, SVM-based modules using 33 physico-chemical properties, amino acid, and dipeptide composition have been able to achieve accuracies of 77.4%, 79.4% and 79.9%, respectively. In addition, BLAST and PSI-BLAST modules designed for predicting secretory proteins based on similarity search achieved 23.4% and 26.9% accuracy, respectively. Finally, we developed a hybrid-approach by integrating amino acid and dipeptide composition based SVM modules and PSI-BLAST module that increased the accuracy to 83.2%, which is significantly better than individual modules. We also achieved high sensitivity of 60.4% with low value of 5% false positive predictions using hybrid module. A web server SRTpred has been developed based on above study for predicting classical and non-classical secreted proteins from whole sequence of mammalian proteins, which is available from http://www.imtech.res.in/raghava/srtpred/.

Gajendra P. S. Raghava | Aarti Garg | Gajendra P.S. Raghava | A. Garg | Aarti Garg

[1] Gajendra P. S. Raghava,et al. PSLpred: prediction of subcellular localization of bacterial proteins , 2005, Bioinform..

[2] S. Brunak,et al. Improved prediction of signal peptides: SignalP 3.0. , 2004, Journal of molecular biology.

[3] Gajendra P. S. Raghava,et al. Quantification of the variation in percentage identity for protein sequence alignments , 2006, BMC Bioinformatics.

[4] Rolf Apweiler,et al. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 , 2000, Nucleic Acids Res..

[5] Zhirong Sun,et al. Support vector machine approach for protein subcellular localization prediction , 2001, Bioinform..

[6] E. Myers,et al. Basic local alignment search tool. , 1990, Journal of molecular biology.

[7] S. Lory. Secretion of proteins and assembly of bacterial surface organelles: shared pathways of extracellular protein targeting. , 1998, Current opinion in microbiology.

[8] Stavros J. Hamodrakas,et al. PredSL: A Tool for the N-terminal Sequence-based Prediction of Protein Subcellular Localization , 2006, Genom. Proteom. Bioinform..

[9] Gajendra P. S. Raghava,et al. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[10] B. Matthews. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[11] Jenn-Kang Hwang,et al. Predicting subcellular localization of proteins for Gram‐negative bacteria by support vector machines based on n‐peptide compositions , 2004, Protein science : a publication of the Protein Society.

[12] M. Bhasin,et al. Support Vector Machine-based Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search* , 2005, Journal of Biological Chemistry.

[13] Geoffrey E. Hinton,et al. Learning representations by back-propagation errors, nature , 1986 .

[14] Thorsten Joachims,et al. Making large scale SVM learning practical , 1998 .

[15] Douglas N W Cooper,et al. Galectinomics: finding themes in complexity. , 2002, Biochimica et biophysica acta.

[16] W. Nickel. The mystery of nonclassical protein secretion. A current view on cargo proteins and potential export routes. , 2003, European journal of biochemistry.

[17] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[18] Minoru Kanehisa,et al. Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[19] R. Hughes. Secretion of the galectin family of mammalian carbohydrate-binding proteins. , 1999, Biochimica et biophysica acta.

[21] Arne Elofsson,et al. Prediction of MHC class I binding peptides, using SVMHC , 2002, BMC Bioinformatics.

[22] Chittibabu Guda,et al. pTARGET: a web server for predicting protein subcellular localization , 2006, Nucleic Acids Res..

[23] N. Blom,et al. Feature-based prediction of non-classical and leaderless protein secretion. , 2004, Protein engineering, design & selection : PEDS.

[24] D Haussler,et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[25] Rolf Apweiler,et al. The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[26] Eric W Klee,et al. Computational classification of classically secreted proteins. , 2007, Drug discovery today.