An Ensemble Method to Distinguish Bacteriophage Virion from Non-Virion Proteins Based on Protein Sequence Characteristics

Bacteriophage virion proteins and non-virion proteins have distinct functions in biological processes, such as specificity determination for host bacteria, bacteriophage replication and transcription. Accurate identification of bacteriophage virion proteins from bacteriophage protein sequences is significant to understand the complex virulence mechanism in host bacteria and the influence of bacteriophages on the development of antibacterial drugs. In this study, an ensemble method for bacteriophage virion protein prediction from bacteriophage protein sequences is put forward with hybrid feature spaces incorporating CTD (composition, transition and distribution), bi-profile Bayes, PseAAC (pseudo-amino acid composition) and PSSM (position-specific scoring matrix). When performing on the training dataset 10-fold cross-validation, the presented method achieves a satisfactory prediction result with a sensitivity of 0.870, a specificity of 0.830, an accuracy of 0.850 and Matthew’s correlation coefficient (MCC) of 0.701, respectively. To evaluate the prediction performance objectively, an independent testing dataset is used to evaluate the proposed method. Encouragingly, our proposed method performs better than previous studies with a sensitivity of 0.853, a specificity of 0.815, an accuracy of 0.831 and MCC of 0.662 on the independent testing dataset. These results suggest that the proposed method can be a potential candidate for bacteriophage virion protein prediction, which may provide a useful tool to find novel antibacterial drugs and to understand the relationship between bacteriophage and host bacteria. For the convenience of the vast majority of experimental scientists, a user-friendly and publicly-accessible web-server for the proposed ensemble method is established.

[1]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[2]  Runtao Yang,et al.  An Ensemble Method with Hybrid Features to Identify Extracellular Matrix Proteins , 2015, PloS one.

[3]  Chaochun Wei,et al.  LAceP: Lysine Acetylation Site Prediction Using Logistic Regression Classifiers , 2014, PloS one.

[4]  Scott C. Weaver,et al.  Structural and Nonstructural Protein Genome Regions of Eastern Equine Encephalitis Virus Are Determinants of Interferon Sensitivity and Murine Virulence , 2008, Journal of Virology.

[5]  C Nave,et al.  Molecular models and structural comparisons of native and mutant class I filamentous bacteriophages Ff (fd, f1, M13), If1 and IKe. , 1994, Journal of molecular biology.

[6]  Kuo-Chen Chou,et al.  Predicting protein oxidation sites with feature selection and analysis approach , 2012, Journal of biomolecular structure & dynamics.

[7]  Yu-Dong Cai,et al.  Computational prediction and analysis of protein γ-carboxylation sites based on a random forest method. , 2012, Molecular bioSystems.

[8]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[9]  S. Brunak,et al.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. , 2000, Journal of molecular biology.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  K. Chou Structural bioinformatics and its impact to biomedical science. , 2004, Current medicinal chemistry.

[12]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[13]  Xing-Ming Zhao,et al.  Prediction of S-Glutathionylation Sites Based on Protein Sequences , 2013, PloS one.

[14]  Xingming Zhao,et al.  Predicting protein–protein interactions from protein sequences using meta predictor , 2010, Amino Acids.

[15]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[16]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[17]  S Rackovsky,et al.  Optimized representations and maximal information in proteins , 2000, Proteins.

[18]  Adam Godzik,et al.  Clustering of highly homologous sequences to reduce the size of large protein databases , 2001, Bioinform..

[19]  Hao Lin,et al.  Prediction of ketoacyl synthase family using reduced amino acid alphabets , 2012, Journal of Industrial Microbiology & Biotechnology.

[20]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[21]  Yu-Yen Ou,et al.  Prediction of transporter targets using efficient RBF networks with PSSM profiles and biochemical properties , 2011, Bioinform..

[22]  M. Byrne,et al.  Nucleotide and complete amino acid sequences of Kunjin virus: definitive gene order and characteristics of the virus-specified proteins. , 1988, The Journal of general virology.

[23]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[24]  Wei Chen,et al.  Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. , 2014, Molecular bioSystems.

[25]  Dinesh Gupta,et al.  VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens , 2008, BMC Bioinformatics.

[26]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[27]  Wentian Li,et al.  Three lectures on case-control genetic association analysis , 2007, Briefings Bioinform..

[28]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[29]  Hui Ding,et al.  AcalPred: A Sequence-Based Tool for Discriminating between Acidic and Alkaline Enzymes , 2013, PloS one.

[30]  K. Chou,et al.  iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. , 2013, Analytical biochemistry.

[31]  A M Eroshkin,et al.  Mutations in fd phage major coat protein modulate affinity of the displayed peptide. , 2009, Protein engineering, design & selection : PEDS.

[32]  Gajendra P.S. Raghava,et al.  RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information , 2009, Proteomics.

[33]  Honglin Li,et al.  An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis , 2012, BMC Bioinformatics.

[34]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[35]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[36]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[37]  Liang Fu,et al.  Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou's PseAAC. , 2013, Protein engineering, design & selection : PEDS.

[38]  Bogdan Gabrys,et al.  Classifier selection for majority voting , 2005, Inf. Fusion.

[39]  G. Hanlon,et al.  Bacteriophages: an appraisal of their role in the treatment of bacterial infections. , 2007, International journal of antimicrobial agents.

[40]  Yu-Chu Tian,et al.  An Ensemble Method for Predicting Subnuclear Localizations from Primary Protein Structures , 2013, PloS one.

[41]  Kuo-Chen Chou,et al.  Prediction of Protein Domain with mRMR Feature Selection and Analysis , 2012, PloS one.

[42]  Victor Seguritan,et al.  Artificial Neural Networks Trained to Detect Viral and Phage Structural Proteins , 2012, PLoS Comput. Biol..

[43]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[44]  Jin‐Pei Cheng,et al.  Differentiation between two‐state and multi‐state folding proteins based on sequence , 2008, Proteins.

[45]  Ke Chen,et al.  Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs , 2007, BMC Structural Biology.

[46]  W. Li,et al.  Hybrid approaches to attribute reduction based on indiscernibility and discernibility relation , 2011, Int. J. Approx. Reason..

[47]  Jason Li,et al.  Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages , 2007, BMC Bioinformatics.

[48]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[49]  Matt Cane,et al.  A proteomic approach to the identification of the major virion structural proteins of the marine cyanomyovirus S-PM2. , 2008, Microbiology.

[50]  Clark Denton,et al.  Bacteriophages : biology, applications and role in health and disease , 2013 .

[51]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Xuan Xiao,et al.  NRPred-FS: A Feature Selection based Two-level Predictor for NuclearReceptors , 2014 .

[53]  Subhash G. Vasudevan,et al.  High Affinity Human Antibody Fragments to Dengue Virus Non-Structural Protein 3 , 2010, PLoS neglected tropical diseases.

[54]  Y. Z. Chen,et al.  Prediction of MHC-binding peptides of flexible lengths from sequence-derived structural and physicochemical properties. , 2007, Molecular immunology.

[55]  Wei Chen,et al.  Naïve Bayes Classifier with Feature Selection to Identify Phage Virion Proteins , 2013, Comput. Math. Methods Medicine.

[56]  Shao-Chun Jia,et al.  Using random forest algorithm to predict β-hairpin motifs. , 2011, Protein and peptide letters.

[57]  Qian-zhong Li,et al.  Predicting protein submitochondria locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition , 2011, Amino Acids.

[58]  Changqing Li,et al.  An Ensemble Classifier for Eukaryotic Protein Subcellular Location Prediction Using Gene Ontology Categories and Amino Acid Hydrophobicity , 2012, PloS one.

[59]  François Fenaille,et al.  Bacterial detection using unlabeled phage amplification and mass spectrometry through structural and nonstructural phage markers. , 2014, Journal of proteome research.

[60]  Yan Huang,et al.  Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features , 2012, BMC Bioinformatics.

[61]  Moselio Schaechter,et al.  Desk encyclopedia of microbiology , 2004 .

[62]  Q. Zou,et al.  enDNA-Prot: Identification of DNA-Binding Proteins by Applying Ensemble Learning , 2014, BioMed research international.

[63]  Bo Jiang,et al.  Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes , 2014, PloS one.