Review and comparative analysis of machine learning-based phage virion protein identification methods.

Phage virion protein (PVP) identification plays key role in elucidating relationships between phages and hosts. Moreover, PVP identification can facilitate the design of related biochemical entities. Recently, several machine learning approaches have emerged for this purpose and have shown their potential capacities. In this study, the proposed PVP identifiers are systemically reviewed, and the related algorithms and tools are comprehensively analyzed. We summarized the common framework of these PVP identifiers and constructed our own novel identifiers based upon the framework. Furthermore, we focus on a performance comparison of all PVP identifiers by using a training dataset and an independent dataset. Highlighting the pros and cons of these identifiers demonstrates that g-gap DPC (dipeptide composition) features are capable of representing characteristics of PVPs. Moreover, SVM (support vector machine) is proven to be the more effective classifier to distinguish PVPs and non-PVPs.

[1]  Victor Seguritan,et al.  Artificial Neural Networks Trained to Detect Viral and Phage Structural Proteins , 2012, PLoS Comput. Biol..

[2]  Xiangxiang Zeng,et al.  MOEA/HD: A Multiobjective Evolutionary Algorithm Based on Hierarchical Decomposition , 2019, IEEE Transactions on Cybernetics.

[3]  Zhen Liu,et al.  Identification of Bacteriophage Virion Proteins Using Multinomial Naïve Bayes with g-Gap Feature Tree , 2018, International journal of molecular sciences.

[4]  Gerhard Nahler,et al.  Pearson Correlation Coefficient , 2020, Definitions.

[5]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[6]  Renzhi Cao,et al.  Survey of Machine Learning Techniques in Drug Discovery. , 2019, Current drug metabolism.

[7]  Jie Hou,et al.  DeepQA: improving the estimation of single protein model quality with deep belief networks , 2016, BMC Bioinformatics.

[8]  Liang Yu,et al.  The extraction of drug-disease correlations based on module distance in incomplete human interactome , 2016, BMC Systems Biology.

[9]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[10]  Quan Zou,et al.  Incorporating Distance-based Top-n-gram and Random Forest to Identify Electron Transport Proteins. , 2019, Journal of proteome research.

[11]  Fu-Ying Dao,et al.  Identifying Phage Virion Proteins by Using Two-Step Feature Selection Methods , 2018, Molecules.

[12]  Wei Chen,et al.  Naïve Bayes Classifier with Feature Selection to Identify Phage Virion Proteins , 2013, Comput. Math. Methods Medicine.

[13]  Xiangrong Liu,et al.  Computational methods for identifying the critical nodes in biological networks , 2019, Briefings Bioinform..

[14]  Muhammad Naeem,et al.  A mutation in CTSK gene in an autosomal recessive pycnodysostosis family of Pakistani origin , 2009, BMC Medical Genetics.

[15]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[16]  Jijun Tang,et al.  FKL-Spa-LapRLS: an accurate method for identifying human microRNA-disease association , 2018, BMC Genomics.

[17]  Wei Chen,et al.  Predicting protein structural classes for low-similarity sequences by evaluating different features , 2019, Knowl. Based Syst..

[18]  Jijun Tang,et al.  Identification of drug-target interactions via multiple information integration , 2017, Inf. Sci..

[19]  Chris H. Q. Ding,et al.  Minimum Redundancy Feature Selection from Microarray Gene Expression Data , 2005, J. Bioinform. Comput. Biol..

[20]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[21]  Xiaozhao Fang,et al.  Protein fold recognition based on multi-view modeling , 2019, Bioinform..

[22]  Leyi Wei,et al.  A novel hierarchical selective ensemble classifier with bioinformatics application , 2017, Artif. Intell. Medicine.

[23]  Hua Tang,et al.  Identification of Bacterial Cell Wall Lyases via Pseudo Amino Acid Composition , 2016, BioMed research international.

[24]  Lan Huang,et al.  A Central Edge Selection Based Overlapping Community Detection Algorithm for the Detection of Overlapping Structures in Protein–Protein Interaction Networks , 2018, Molecules.

[25]  Xiangxiang Zeng,et al.  Probability-based collaborative filtering model for predicting gene–disease associations , 2017, BMC Medical Genomics.

[26]  Xiangxiang Zeng,et al.  Predicting disease-associated circular RNAs using deep forests combined with positive-unlabeled learning methods , 2020, Briefings Bioinform..

[27]  Jiu-Xin Tan,et al.  A Survey for Predicting Enzyme Family Classes Using Machine Learning Methods. , 2019, Current drug targets.

[28]  H. Ackermann,et al.  5500 Phages examined in the electron microscope , 2007, Archives of Virology.

[29]  David G. Kirkpatrick,et al.  Linear Time Euclidean Distance Algorithms , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Shiping Wen,et al.  Synchronization of memristive neural networks with leakage delay and parameters mismatch via event-triggered control , 2019, Neural Networks.

[31]  E. Stella,et al.  Analysis of Novel Mycobacteriophages Indicates the Existence of Different Strategies for Phage Inheritance in Mycobacteria , 2013, PloS one.

[32]  Bin Liu,et al.  DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks , 2019, Briefings Bioinform..

[33]  Kai Li,et al.  iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features , 2019, Molecular therapy. Nucleic acids.

[34]  Alfonso Rodríguez-Patón,et al.  Meta-Path Methods for Prioritizing Candidate Disease miRNAs , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Xiangxiang Zeng,et al.  Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  Chen Lin,et al.  Learning to Predict Drug Target Interaction From Missing Not at Random Labels , 2019, IEEE Transactions on NanoBioscience.

[37]  H. Ding,et al.  Identification of mitochondrial proteins of malaria parasite using analysis of variance , 2014, Amino Acids.

[38]  Dariusz Mrozek,et al.  Spark-IDPP: high-throughput and scalable prediction of intrinsically disordered protein regions with Spark clusters on the Cloud , 2018, Cluster Computing.

[39]  Jiangning Song,et al.  Comparative analysis and prediction of quorum-sensing peptides using feature representation learning and machine learning algorithms , 2018, Briefings Bioinform..

[40]  Jin Zhao,et al.  Drug repositioning based on triangularly balanced structure for tissue-specific diseases in incomplete interactome , 2017, Artif. Intell. Medicine.

[41]  Xiangxiang Zeng,et al.  Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks , 2016, Briefings Bioinform..

[42]  Hui Ding,et al.  A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features , 2019, Front. Bioeng. Biotechnol..

[43]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[44]  Marti J. Anderson,et al.  A new method for non-parametric multivariate analysis of variance in ecology , 2001 .

[45]  Rob Lavigne,et al.  Learning from Bacteriophages - Advantages and Limitations of Phage and Phage-Encoded Protein Applications , 2012, Current protein & peptide science.

[46]  Gabriel del Rio,et al.  Effective Design of Multifunctional Peptides by Combining Compatible Functions , 2016, PLoS Comput. Biol..

[47]  Xiangxiang Zeng,et al.  Spiking Neural P Systems With Scheduled Synapses , 2017, IEEE Transactions on NanoBioscience.

[48]  Lin Gao,et al.  Inferring drug-disease associations based on known protein complexes , 2015, BMC Medical Genomics.

[49]  Xiangrong Liu,et al.  deepDR: a network-based deep learning approach to in silico drug repositioning , 2019, Bioinform..

[50]  Xiangrong Liu,et al.  Identifying enhancer-promoter interactions with neural network based on pre-trained DNA vectors and attention mechanism , 2019, Bioinform..

[51]  Jiu-Xin Tan,et al.  Evaluation of different computational methods on 5-methylcytosine sites identification , 2020, Briefings Bioinform..

[52]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[53]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[54]  Liang Yu,et al.  Conserved Disease Modules Extracted From Multilayer Heterogeneous Disease and Gene Networks for Understanding Disease Mechanisms and Predicting Disease Treatments , 2019, Front. Genet..

[55]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[56]  Hua Tang,et al.  Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. , 2016, Molecular bioSystems.

[57]  Cheng Chen,et al.  SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting , 2020, Bioinform..

[58]  Jian Huang,et al.  A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization , 2019, Current Bioinformatics.

[59]  G. Yen,et al.  A Consensus Community-Based Particle Swarm Optimization for Dynamic Community Detection , 2020, IEEE Transactions on Cybernetics.

[60]  Seokjun Seo,et al.  PINTnet: construction of condition-specific pathway interaction network by computing shortest paths on weighted PPI , 2017, BMC Systems Biology.

[61]  Jijun Tang,et al.  Identification of drug-side effect association via multiple information integration with centered kernel alignment , 2019, Neurocomputing.

[62]  Bin Liu,et al.  HITS-PR-HHblits: protein remote homology detection by combining PageRank and Hyperlink-Induced Topic Search , 2018, Briefings Bioinform..

[63]  Dariusz Mrozek,et al.  Cloud4Psi: cloud computing for 3D protein structure similarity searching , 2014, Bioinform..

[64]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[65]  Xinyi Liu,et al.  Deep-Resp-Forest: A deep forest model to predict anti-cancer drug response. , 2019, Methods.

[66]  Dariusz Mrozek,et al.  Scaling Ab Initio Predictions of 3D Protein Structures in Microsoft Azure Cloud , 2015, Journal of Grid Computing.

[67]  Kumardeep Chaudhary,et al.  An in silico platform for predicting, screening and designing of antihypertensive peptides , 2015, Scientific Reports.

[68]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[69]  Wei Lin,et al.  A comprehensive overview and evaluation of circular RNA detection tools , 2017, PLoS Comput. Biol..

[70]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[71]  Runtao Yang,et al.  An Ensemble Method to Distinguish Bacteriophage Virion from Non-Virion Proteins Based on Protein Sequence Characteristics , 2015, International journal of molecular sciences.

[72]  Xiangxiang Zeng,et al.  An Evolutionary Algorithm Based on Minkowski Distance for Many-Objective Optimization , 2019, IEEE Transactions on Cybernetics.

[73]  Xingpeng Jiang,et al.  Sequence clustering in bioinformatics: an empirical study. , 2018, Briefings in bioinformatics.

[74]  W. Filipowicz,et al.  Relief of microRNA-Mediated Translational Repression in Human Cells Subjected to Stress , 2006, Cell.

[75]  Hua Tang,et al.  A two-step discriminated method to identify thermophilic proteins , 2017 .

[76]  Xiangxiang Zeng,et al.  Spiking Neural P Systems With Colored Spikes , 2018, IEEE Transactions on Cognitive and Developmental Systems.

[77]  Lin Gao,et al.  Predicting Potential Drugs for Breast Cancer based on miRNA and Tissue Specificity , 2018, International journal of biological sciences.

[78]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[79]  M. Weinbauer Ecology of prokaryotic viruses. , 2004, FEMS microbiology reviews.

[80]  Weiwei Liu,et al.  Generating Realistic Videos From Keyframes With Concatenated GANs , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[81]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[82]  Jing Zhang,et al.  Prediction of Novel Drugs for Hepatocellular Carcinoma Based on Multi-Source Random Walk , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[83]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[84]  W Gibson,et al.  Structure and assembly of the virion. , 1996, Intervirology.

[85]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[86]  Manuel Fuentes,et al.  Screening Phage-Display Antibody Libraries Using Protein Arrays. , 2018, Methods in molecular biology.

[87]  Bin Yu,et al.  Predicting protein-protein interactions by fusing various Chou's pseudo components and using wavelet denoising approach. , 2019, Journal of theoretical biology.

[88]  Jijun Tang,et al.  Predicting protein-protein interactions via multivariate mutual information of protein sequences , 2016, BMC Bioinformatics.

[89]  Xiangxiang Zeng,et al.  Reconstructing evolutionary trees in parallel for massive sequences , 2017, BMC Systems Biology.

[90]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[91]  Shiping Wen,et al.  Passivity and passification of memristive neural networks with leakage term and time-varying delays , 2019, Appl. Math. Comput..

[92]  José Luis Balcázar,et al.  Exploring the contribution of bacteriophages to antibiotic resistance. , 2017, Environmental pollution.

[93]  Jijun Tang,et al.  Identification of Protein-Ligand Binding Sites by Sequence Information and Ensemble Classifier , 2017, J. Chem. Inf. Model..

[94]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[95]  Yann LeCun,et al.  Measuring the VC-Dimension of a Learning Machine , 1994, Neural Computation.

[96]  Dariusz Mrozek,et al.  HDInsight4PSi: Boosting performance of 3D protein structure similarity searching with HDInsight clusters in Microsoft Azure cloud , 2016, Inf. Sci..

[97]  Bin Liu,et al.  ProtDec-LTR3.0: Protein Remote Homology Detection by Incorporating Profile-Based Features Into Learning to Rank , 2019, IEEE Access.

[98]  Wei Chen,et al.  Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis. , 2014, Molecular bioSystems.

[99]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[100]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[101]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[102]  Han Zhang,et al.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , 2019, Nucleic acids research.