DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy

Virulence factors (VFs) enable pathogens to infect their hosts. A wealth of individual, disease-focused studies has identified a wide variety of VFs, and the growing mass of bacterial genome sequence data provides an opportunity for computational methods aimed at predicting VFs. Despite their attractive advantages and performance improvements, the existing methods have some limitations and drawbacks. Firstly, as the characteristics and mechanisms of VFs are continually evolving with the emergence of antibiotic resistance, it is more and more difficult to identify novel VFs using existing tools that were previously developed based on the outdated data sets; secondly, few systematic feature engineering efforts have been made to examine the utility of different types of features for model performances, as the majority of tools only focused on extracting very few types of features. By addressing the aforementioned issues, the accuracy of VF predictors can likely be significantly improved. This, in turn, would be particularly useful in the context of genome wide predictions of VFs. In this work, we present a deep learning (DL)-based hybrid framework (termed DeepVF) that is utilizing the stacking strategy to achieve more accurate identification of VFs. Using an enlarged, up-to-date dataset, DeepVF comprehensively explores a wide range of heterogeneous features with popular machine learning algorithms. Specifically, four classical algorithms, including random forest, support vector machines, extreme gradient boosting and multilayer perceptron, and three DL algorithms, including convolutional neural networks, long short-term memory networks and deep neural networks are employed to train 62 baseline models using these features. In order to integrate their individual strengths, DeepVF effectively combines these baseline models to construct the final meta model using the stacking strategy. Extensive benchmarking experiments demonstrate the effectiveness of DeepVF: it achieves a more accurate and stable performance compared with baseline models on the benchmark dataset and clearly outperforms state-of-the-art VF predictors on the independent test. Using the proposed hybrid ensemble model, a user-friendly online predictor of DeepVF (http://deepvf.erc.monash.edu/) is implemented. Furthermore, its utility, from the user's viewpoint, is compared with that of existing toolkits. We believe that DeepVF will be exploited as a useful tool for screening and identifying potential VFs from protein-coding gene sequences in bacterial genomes.

[1]  Nikola Biller-Andorno,et al.  Infectious diseases – A global challenge , 2006, International Journal of Medical Microbiology.

[2]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[3]  Hugo Y. K. Lam,et al.  Deep convolutional neural networks for accurate somatic mutation detection , 2018, Nature Communications.

[4]  D. Merrell,et al.  Cellular and Infection Microbiology , 2022 .

[5]  Ole Winther,et al.  An introduction to deep learning on biological sequence data: examples and solutions , 2017, Bioinform..

[6]  Faisal Saeed,et al.  Bioactive Molecule Prediction Using Extreme Gradient Boosting , 2016, Molecules.

[7]  Dominik Heider,et al.  Deep learning on chaos game representation for proteins , 2020, Bioinform..

[8]  Dinesh Gupta,et al.  VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens , 2008, BMC Bioinformatics.

[9]  Geoffrey I. Webb,et al.  Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework , 2018, Briefings Bioinform..

[10]  Anne-Brit Kolstø,et al.  Bacillus anthracis, Bacillus cereus, and Bacillus thuringiensis—One Species on the Basis of Genetic Evidence , 2000, Applied and Environmental Microbiology.

[11]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[12]  Geoffrey I. Webb,et al.  Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches , 2017, Briefings Bioinform..

[13]  Geoffrey I. Webb,et al.  POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles , 2017, Bioinform..

[14]  Jiangning Song,et al.  PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins , 2019, Bioinform..

[15]  T. Cebula,et al.  Genomic anatomy of Escherichia coli O157:H7 outbreaks , 2011, Proceedings of the National Academy of Sciences.

[16]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[17]  Vijayakumar Saravanan,et al.  Harnessing Computational Biology for Exact Linear B-Cell Epitope Prediction: A Novel Amino Acid Composition-Based Feature Descriptor. , 2015, Omics : a journal of integrative biology.

[18]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[19]  Loris Nanni,et al.  An ensemble of support vector machines for predicting virulent proteins , 2009, Expert Syst. Appl..

[20]  A. Casadevall,et al.  Host-pathogen interactions: the attributes of virulence. , 2001, The Journal of infectious diseases.

[21]  Jakob Nikolas Kather,et al.  Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer , 2019, Nature Medicine.

[22]  Mark Gerstein,et al.  High-order neural networks and kernel methods for peptide-MHC binding prediction , 2015, Bioinform..

[23]  Ole Winther,et al.  Convolutional LSTM Networks for Subcellular Localization of Proteins , 2015, AlCoB.

[24]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[25]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[26]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[27]  Ehsaneddin Asgari,et al.  ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.

[28]  N. Razavian,et al.  Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning , 2018, Nature Medicine.

[29]  Pierre Baldi,et al.  Deep architectures for protein contact map prediction , 2012, Bioinform..

[30]  Jiangning Song,et al.  Bastion3: a two-layer ensemble predictor of type III secreted effectors , 2018, Bioinform..

[31]  Quan Zou,et al.  HPSLPred: An Ensemble Multi‐Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source , 2017, Proteomics.

[32]  Raghvendra Mall,et al.  DeepSol: a deep learning framework for sequence‐based protein solubility prediction , 2018, Bioinform..

[33]  Yan Wang,et al.  DNN-Dom: predicting protein domain boundary from sequence alone by deep neural network , 2019, Bioinform..

[34]  Ryan S. Miller,et al.  Diseases at the livestock–wildlife interface: Status, challenges, and opportunities in the United States , 2012, Preventive Veterinary Medicine.

[35]  Cigdem Demir,et al.  Augmented cell-graphs for automated cancer diagnosis , 2005, ECCB/JBI.

[36]  Ying Ju,et al.  Improving tRNAscan‐SE Annotation Results via Ensemble Classifiers , 2015, Molecular informatics.

[37]  R. Weiss,et al.  Virulence and pathogenesis. , 2002, Trends in microbiology.

[38]  Wei Chen,et al.  Detecting N6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines , 2017, Scientific Reports.

[39]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[40]  Jiangning Song,et al.  ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides , 2018, Bioinform..

[41]  Yanjun Qi,et al.  DeepChrome: deep-learning for predicting gene expression from histone modifications , 2016, Bioinform..

[42]  B. Das,et al.  CTX phage of Vibrio cholerae: Genomics and applications. , 2020, Vaccine.

[43]  Guodong Chen,et al.  PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization , 2018, Bioinform..

[44]  Jian Huang,et al.  Prediction and identification of the effectors of heterotrimeric G proteins in rice (Oryza sativa L.) , 2016, Briefings Bioinform..

[45]  Dong-Qing Wei,et al.  PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method , 2018, Front. Microbiol..

[46]  Iddo Friedberg,et al.  Identifying antimicrobial peptides using word embedding with deep recurrent neural networks , 2018, bioRxiv.

[47]  Jiangning Song,et al.  Bastion6: a bioinformatics approach for accurate prediction of type VI secreted effectors , 2018, Bioinform..

[48]  K. Chou,et al.  Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. , 2000, Biochemical and biophysical research communications.

[49]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[50]  Kuo-Chen Chou,et al.  A Comparison of Computational Methods for Identifying Virulence Factors , 2012, PloS one.

[51]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[52]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[53]  Xing-Ming Zhao,et al.  Victors: a web-based knowledge base of virulence factors in human and animal pathogens , 2018, Nucleic Acids Res..

[54]  Maulik Shukla,et al.  Curation, integration and visualization of bacterial virulence factors in PATRIC , 2014, Bioinform..

[55]  A. Cross What is a virulence factor? , 2008, Critical care.

[56]  O. Stegle,et al.  Deep learning for computational biology , 2016, Molecular systems biology.

[57]  Klaus Obermayer,et al.  Fast model-based protein homology detection without alignment , 2007, Bioinform..

[58]  Wolf-Dietrich Hardt,et al.  Phages and the Evolution of Bacterial Pathogens: from Genomic Rearrangements to Lysogenic Conversion , 2004, Microbiology and Molecular Biology Reviews.

[59]  Reza Ebrahimpour,et al.  PPIevo: protein-protein interaction prediction from PSSM based evolutionary information. , 2013, Genomics.

[60]  Xue-wen Chen,et al.  Sequence-based prediction of protein interaction sites with an integrative method , 2009, Bioinform..

[61]  Geoffrey I. Webb,et al.  Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI , 2016, Briefings Bioinform..

[62]  Shinn-Ying Ho,et al.  Virulent-GO: Prediction of Virulent Proteins in Bacterial Pathogens Utilizing Gene Ontology Terms , 2009 .

[63]  Dinesh Gupta,et al.  Identifying Bacterial Virulent Proteins by Fusing a Set of Classifiers Based on Variants of Chou's Pseudo Amino Acid Composition and on Evolutionary Information , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[64]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[65]  M. Kollef,et al.  Re-estimating annual deaths due to multidrug-resistant organism infections , 2018, Infection Control & Hospital Epidemiology.

[66]  Vineet K. Sharma,et al.  MP3: A Software Tool for the Prediction of Pathogenic Proteins in Genomic and Metagenomic Data , 2014, PloS one.

[67]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[68]  Geoffrey I. Webb,et al.  Large-scale comparative assessment of computational predictors for lysine post-translational modification sites , 2018, Briefings Bioinform..

[69]  Bernhard Y. Renard,et al.  Predicting bacterial virulence factors - evaluation of machine learning and negative data strategies , 2020, Briefings Bioinform..

[70]  Runtao Yang,et al.  An Ensemble Method to Distinguish Bacteriophage Virion from Non-Virion Proteins Based on Protein Sequence Characteristics , 2015, International journal of molecular sciences.

[71]  Charu C. Aggarwal,et al.  Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 2016, KDD.

[72]  Rida Assaf,et al.  Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center , 2016, Nucleic Acids Res..

[73]  Lingyun Zou,et al.  Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles , 2013, Bioinform..

[74]  Amarda Shehu,et al.  Deep learning improves antimicrobial peptide recognition , 2018, Bioinform..

[75]  Lei Chen,et al.  Computationally identifying virulence factors based on KEGG pathways. , 2013, Molecular bioSystems.

[76]  Robert Clarke,et al.  Optimized multilayer perceptrons for molecular classification and diagnosis using genomic data , 2006, Bioinform..

[77]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[78]  Jian Zhao,et al.  CarcinoPred-EL: Novel models for predicting the carcinogenicity of chemicals using molecular fingerprints and ensemble learning methods , 2017, Scientific Reports.

[79]  Maziyar Baran Pouyan,et al.  Random forest based similarity learning for single cell RNA sequencing data , 2018, bioRxiv.

[80]  Jian Yang,et al.  VFDB 2019: a comparative pathogenomic platform with an interactive web interface , 2018, Nucleic Acids Res..

[81]  Robert L. Smith,et al.  Estimating confidence intervals for information transfer analysis of confusion matrices. , 2014, The Journal of the Acoustical Society of America.