PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine

Accurately identifying bacteriophage virion proteins from uncharacterized sequences is important to understand interactions between the phage and its host bacteria in order to develop new antibacterial drugs. However, identification of such proteins using experimental techniques is expensive and often time consuming; hence, development of an efficient computational algorithm for the prediction of phage virion proteins (PVPs) prior to in vitro experimentation is needed. Here, we describe a support vector machine (SVM)-based PVP predictor, called PVP-SVM, which was trained with 136 optimal features. A feature selection protocol was employed to identify the optimal features from a large set that included amino acid composition, dipeptide composition, atomic composition, physicochemical properties, and chain-transition-distribution. PVP-SVM achieved an accuracy of 0.870 during leave-one-out cross-validation, which was 6% higher than control SVM predictors trained with all features, indicating the efficiency of the feature selection method. Furthermore, PVP-SVM displayed superior performance compared to the currently available method, PVPred, and two other machine-learning methods developed in this study when objectively evaluated with an independent dataset. For the convenience of the scientific community, a user-friendly and publicly accessible web server has been established at www.thegleelab.org/PVP-SVM/PVP-SVM.html.

[1]  Rob Lavigne,et al.  Phage proteomics: applications of mass spectrometry. , 2009, Methods in molecular biology.

[2]  Victor Seguritan,et al.  Artificial Neural Networks Trained to Detect Viral and Phage Structural Proteins , 2012, PLoS Comput. Biol..

[3]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[4]  Xin Deng,et al.  DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning , 2011, BMC Bioinformatics.

[5]  Runtao Yang,et al.  An Ensemble Method to Distinguish Bacteriophage Virion from Non-Virion Proteins Based on Protein Sequence Characteristics , 2015, International journal of molecular sciences.

[6]  Tao Zeng,et al.  Prediction of heme binding residues from protein sequences with integrative sequence profiles , 2012, Proteome Science.

[7]  Wei Chen,et al.  Naïve Bayes Classifier with Feature Selection to Identify Phage Virion Proteins , 2013, Comput. Math. Methods Medicine.

[8]  M. Byrne,et al.  Nucleotide and complete amino acid sequences of Kunjin virus: definitive gene order and characteristics of the virus-specified proteins. , 1988, The Journal of general virology.

[9]  Rob Lavigne,et al.  Learning from Bacteriophages - Advantages and Limitations of Phage and Phage-Encoded Protein Applications , 2012, Current protein & peptide science.

[10]  Vineet K. Sharma,et al.  IL17eScan: A Tool for the Identification of Peptides Inducing IL-17 Response , 2017, Front. Immunol..

[11]  Sangdun Choi,et al.  In Silico Approach to Inhibition of Signaling Pathways of Toll-Like Receptors 2 and 4 by ST2L , 2011, PloS one.

[12]  Sangdun Choi,et al.  Molecular modeling‐based evaluation of dual function of IκBζ ankyrin repeat domain in toll‐like receptor signaling , 2011, Journal of molecular recognition : JMR.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Sangdun Choi,et al.  Structure-Function Relationship of Cytoplasmic and Nuclear IκB Proteins: An In Silico Analysis , 2010, PloS one.

[15]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[16]  Renzhi Cao,et al.  SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines , 2013, BMC Bioinformatics.

[17]  Jeffrey J. P. Tsai,et al.  Machine learning applications in software engineering , 2005 .

[18]  Geoffrey I. Webb,et al.  PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection , 2017, Scientific Reports.

[19]  Jooyoung Lee,et al.  Structure-based protein folding type classification and folding rate prediction , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[20]  Yihui Yuan,et al.  Proteomic Analysis of a Novel Bacillus Jumbo Phage Revealing Glycoside Hydrolase As Structural Component , 2016, Front. Microbiol..

[21]  Wei Chen,et al.  Identification of Antioxidants from Sequence Information Using Naïve Bayes , 2013, Comput. Math. Methods Medicine.

[22]  Kuo-Chen Chou,et al.  iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC , 2016, Oncotarget.

[23]  Torsten Schwede,et al.  Assessment of model accuracy estimations in CASP12 , 2018, Proteins.

[24]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[25]  Kumardeep Chaudhary,et al.  An in silico platform for predicting, screening and designing of antihypertensive peptides , 2015, Scientific Reports.

[26]  Jianlin Cheng,et al.  Evaluating the absolute quality of a single protein model using structural features and support vector machines , 2009, Proteins.

[27]  Gajendra P. S. Raghava,et al.  Prediction of Immunomodulatory potential of an RNA sequence for designing non-toxic siRNAs and RNA-based vaccine adjuvants , 2016, Scientific Reports.

[28]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[29]  Jiangning Song,et al.  An Integrative Computational Framework Based on a Two-Step Random Forest Algorithm Improves Prediction of Zinc-Binding Sites in Proteins , 2012, PloS one.

[30]  Liubin Feng,et al.  Crysalis: an integrated server for computational analysis and design of protein crystallization , 2016, Scientific Reports.

[31]  Balachandran Manavalan,et al.  Random Forest-Based Protein Model Quality Assessment (RFMQA) Using Structural Features and Potential Energy Terms , 2014, PloS one.

[32]  Wei Chen,et al.  Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. , 2014, Molecular bioSystems.

[33]  Kumardeep Chaudhary,et al.  Computational Prediction of the Immunomodulatory Potential of RNA Sequences. , 2017, Methods in molecular biology.

[34]  Xing-Ming Zhao,et al.  FunSAV: Predicting the Functional Effect of Single Amino Acid Variants Using a Two-Stage Random Forest Model , 2012, PloS one.

[35]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[36]  Jun Liang,et al.  An ensemble method , 2018, ICCIP '18.

[37]  José Luis Balcázar,et al.  Exploring the contribution of bacteriophages to antibiotic resistance. , 2017, Environmental pollution.

[38]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[39]  Sangdun Choi,et al.  Molecular modeling of the reductase domain to elucidate the reaction mechanism of reduction of peptidyl thioester into its corresponding alcohol in non-ribosomal peptide synthetases , 2010, BMC Structural Biology.

[40]  Jilong Li,et al.  Predicting Protein Model Quality from Sequence Alignments by Support Vector Machines , 2013, Journal of proteomics & bioinformatics.

[41]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Arne Elofsson,et al.  Methods for estimation of model accuracy in CASP12 , 2017, bioRxiv.

[43]  Yi Xiong,et al.  Improved feature-based prediction of SNPs in human cytochrome P450 enzymes , 2015, Interdisciplinary Sciences: Computational Life Sciences.

[44]  Wei Chen,et al.  Prediction of cell-penetrating peptides with feature selection techniques. , 2016, Biochemical and biophysical research communications.

[45]  Sangdun Choi,et al.  Molecular Modeling-Based Evaluation of hTLR10 and Identification of Potential Ligands in Toll-Like Receptor Signaling , 2010, PloS one.

[46]  Sangdun Choi,et al.  Evolutionary, Structural and Functional Interplay of the IκB Family Members , 2013, PloS one.

[47]  Manuel Fuentes,et al.  Screening Phage-Display Antibody Libraries Using Protein Arrays. , 2018, Methods in molecular biology.

[48]  Wei Chen,et al.  Predicting cancerlectins by the optimal g-gap dipeptides , 2015, Scientific Reports.

[49]  Balachandran Manavalan,et al.  MLACP: machine-learning-based prediction of anticancer peptides , 2017, Oncotarget.

[50]  Jooyoung Lee,et al.  SVMQA: support‐vector‐machine‐based protein single‐model quality assessment , 2017, Bioinform..

[51]  K. Chou,et al.  iSS-PseDNC: Identifying Splicing Sites Using Pseudo Dinucleotide Composition , 2014, BioMed research international.

[52]  Wei Chen,et al.  MethyRNA: a web server for identification of N6-methyladenosine sites , 2017, Journal of biomolecular structure & dynamics.

[53]  Balachandran Manavalan,et al.  DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest , 2017, bioRxiv.

[54]  E. G. Westaway,et al.  Gene mapping and positive identification of the non-structural proteins NS2A, NS2B, NS3, NS4B and NS5 of the flavivirus Kunjin and their cleavage sites. , 1988, The Journal of general virology.

[55]  Sangdun Choi,et al.  Comparative Analysis of Species-Specific Ligand Recognition in Toll-Like Receptor 8 Signaling: A Hypothesis , 2011, PloS one.

[56]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[57]  Yi Xiong,et al.  PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm. , 2017, Journal of theoretical biology.

[58]  K. Chou,et al.  iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. , 2013, Analytical biochemistry.