Identification of Bacteriophage Virion Proteins Using Multinomial Naïve Bayes with g-Gap Feature Tree

Bacteriophages, which are tremendously important to the ecology and evolution of bacteria, play a key role in the development of genetic engineering. Bacteriophage virion proteins are essential materials of the infectious viral particles and in charge of several of biological functions. The correct identification of bacteriophage virion proteins is of great importance for understanding both life at the molecular level and genetic evolution. However, few computational methods are available for identifying bacteriophage virion proteins. In this paper, we proposed a new method to predict bacteriophage virion proteins using a Multinomial Naïve Bayes classification model based on discrete feature generated from the g-gap feature tree. The accuracy of the proposed model reaches 98.37% with MCC of 96.27% in 10-fold cross-validation. This result suggests that the proposed method can be a useful approach in identifying bacteriophage virion proteins from sequence information. For the convenience of experimental scientists, a web server (PhagePred) that implements the proposed predictor is available, which can be freely accessed on the Internet.

[1]  K. Chou,et al.  iRNA-3typeA: Identifying Three Types of Modification at RNA’s Adenosine Sites , 2018, Molecular therapy. Nucleic acids.

[2]  Q. Zou,et al.  SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides , 2017, BMC Genomics.

[3]  Matthew K. Waldor,et al.  Bacteriophage Control of Bacterial Virulence , 2002, Infection and Immunity.

[4]  Michele Magrane,et al.  UniProt Knowledgebase: a hub of integrated protein data , 2011, Database J. Biol. Databases Curation.

[5]  Hua Tang,et al.  IonchanPred 2.0: A Tool to Predict Ion Channels and Their Types , 2017, International journal of molecular sciences.

[6]  Rob Lavigne,et al.  Learning from Bacteriophages - Advantages and Limitations of Phage and Phage-Encoded Protein Applications , 2012, Current protein & peptide science.

[7]  Wei Chen,et al.  Predicting cancerlectins by the optimal g-gap dipeptides , 2015, Scientific Reports.

[8]  Clark Denton,et al.  Bacteriophages : biology, applications and role in health and disease , 2013 .

[9]  Wei Chen,et al.  Pro54DB: a database for experimentally verified sigma‐54 promoters , 2016, Bioinform..

[10]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[11]  K. Chou,et al.  iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC , 2016, Oncotarget.

[12]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[13]  Richard Guest,et al.  Comparing Machine Learning Classifiers and Linear/Logistic Regression to Explore the Relationship between Hand Dimensions and Demographic Characteristics , 2016, PloS one.

[14]  Wei Chen,et al.  Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. , 2014, Molecular bioSystems.

[15]  Jason Li,et al.  Gene function prediction based on genomic context clustering and discriminative learning: an application to bacteriophages , 2007, BMC Bioinformatics.

[16]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[17]  Wei Chen,et al.  Naïve Bayes Classifier with Feature Selection to Identify Phage Virion Proteins , 2013, Comput. Math. Methods Medicine.

[18]  Yan Huang,et al.  RNALocate: a resource for RNA subcellular localizations , 2016, Nucleic Acids Res..

[19]  Xiaolong Wang,et al.  iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach , 2016, Journal of biomolecular structure & dynamics.

[20]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[21]  Hua Tang,et al.  Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition , 2016, BioMed research international.

[22]  Victor Seguritan,et al.  Artificial Neural Networks Trained to Detect Viral and Phage Structural Proteins , 2012, PLoS Comput. Biol..

[23]  Q. Zou,et al.  RicyerDB: A Database For Collecting Rice Yield-related Genes with Biological Analysis , 2018, International journal of biological sciences.

[24]  H. Ackermann,et al.  5500 Phages examined in the electron microscope , 2007, Archives of Virology.

[25]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[26]  Runtao Yang,et al.  An Ensemble Method to Distinguish Bacteriophage Virion from Non-Virion Proteins Based on Protein Sequence Characteristics , 2015, International journal of molecular sciences.

[27]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[28]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[29]  G. Hanlon,et al.  Bacteriophages: an appraisal of their role in the treatment of bacterial infections. , 2007, International journal of antimicrobial agents.

[30]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[31]  Jian Huang,et al.  Prediction of Golgi-resident protein types by using feature selection technique , 2013 .

[32]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[33]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[34]  Miao Sun,et al.  QAcon: single model quality assessment using protein structural and contact information with machine learning techniques , 2016, Bioinform..

[35]  Hui Ding,et al.  AcalPred: A Sequence-Based Tool for Discriminating between Acidic and Alkaline Enzymes , 2013, PloS one.

[36]  Shao-Ping Shi,et al.  Using support vector machines to identify protein phosphorylation sites in viruses. , 2015, Journal of molecular graphics & modelling.

[37]  Kuo-Chen Chou,et al.  iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC , 2016, Oncotarget.

[38]  Chaochun Wei,et al.  LAceP: Lysine Acetylation Site Prediction Using Logistic Regression Classifiers , 2014, PloS one.

[39]  Ke Chen,et al.  Survey of MapReduce frame operation in bioinformatics , 2013, Briefings Bioinform..

[40]  Cornelia Caragea,et al.  Protein Sequence Classification Using Feature Hashing , 2011, BIBM.

[41]  K. Wommack,et al.  Virioplankton: Viruses in Aquatic Ecosystems , 2000, Microbiology and Molecular Biology Reviews.

[42]  Q. Zou,et al.  NucPosPred: Predicting species-specific genomic nucleosome positioning via four different modes of general PseKNC. , 2018, Journal of theoretical biology.

[43]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Moselio Schaechter,et al.  Desk encyclopedia of microbiology , 2004 .

[45]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[46]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[47]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[48]  Verónica Bolón-Canedo,et al.  Data discretization: taxonomy and big data challenge , 2016, WIREs Data Mining Knowl. Discov..