Predicting bacterial virulence factors - evaluation of machine learning and negative data strategies

Bacterial proteins dubbed virulence factors (VFs) are a highly diverse group of sequences, whose only obvious commonality is the very property of being, more or less directly, involved in virulence. It is therefore tempting to speculate whether their prediction, based on direct sequence similarity (seqsim) to known VFs, could be enhanced or even replaced by using machine-learning methods. Specifically, when trained on a large and diverse set of VFs, such may be able to detect putative, non-trivial characteristics shared by otherwise unrelated VF families and therefore better predict novel VFs with insignificant similarity to each individual family. We therefore first reassess the performance of dimer-based Support Vector Machines, as used in the widely used MP3 method, in light of seqsim-only and seqsim/dimer-hybrid classifiers. We then repeat the analysis with a novel, considerably more diverse data set, also addressing the important problem of negative data selection. Finally, we move on to the real-world use case of proteome-wide VF prediction, outlining different approaches to estimating specificity in this scenario. We find that direct seqsim is of unparalleled importance and therefore should always be exploited. Further, we observe strikingly low correlations between different feature and classifier types when ranking proteins by VF likeness. We therefore propose a 'best of each world' approach to prioritize proteins for experimental testing, focussing on the top predictions of each classifier. Further, classifiers for individual VF families should be developed.

[1]  Xing-Ming Zhao,et al.  Victors: a web-based knowledge base of virulence factors in human and animal pathogens , 2018, Nucleic Acids Res..

[2]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[3]  Rida Assaf,et al.  Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center , 2016, Nucleic Acids Res..

[4]  Feng Gao,et al.  Comparative analysis of essential genes in prokaryotic genomic islands , 2015, Scientific Reports.

[5]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[6]  Terri K. Attwood,et al.  FingerPRINTScan: intelligent searching of the PRINTS motif database , 1999, Bioinform..

[7]  S. Gharbia,et al.  Virulence Searcher: a tool for searching raw genome sequences from bacterial genomes for putative virulence factors. , 2005, Clinical microbiology and infection : the official publication of the European Society of Clinical Microbiology and Infectious Diseases.

[8]  Taghi M. Khoshgoftaar,et al.  Identifying learners robust to low quality data , 2008, 2008 IEEE International Conference on Information Reuse and Integration.

[9]  Werner Braun,et al.  Functional classification of protein toxins as a basis for bioinformatic screening , 2017, Scientific Reports.

[10]  Darren R Flower,et al.  Bacterial bioinformatics: pathogenesis and the genome. , 2002, Journal of molecular microbiology and biotechnology.

[11]  Jun Yu,et al.  VFDB: a reference database for bacterial virulence factors , 2004, Nucleic Acids Res..

[12]  Shinn-Ying Ho,et al.  Virulent-GO: Prediction of Virulent Proteins in Bacterial Pathogens Utilizing Gene Ontology Terms , 2009 .

[13]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[14]  Jian Yang,et al.  VFDB 2016: hierarchical and refined dataset for big data analysis—10 years on , 2015, Nucleic Acids Res..

[15]  G. S. Chhatwal,et al.  Housekeeping enzymes as virulence factors for pathogens. , 2003, International journal of medical microbiology : IJMM.

[16]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[17]  The Gene Ontology Consortium,et al.  Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[18]  Jian Yang,et al.  VFDB 2019: a comparative pathogenomic platform with an interactive web interface , 2018, Nucleic Acids Res..

[19]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[20]  Yan Lin,et al.  DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements , 2013, Nucleic Acids Res..

[21]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[22]  Adam Zemla,et al.  MvirDB—a microbial database of protein toxins, virulence factors and antibiotic resistance genes for bio-defence applications , 2006, Nucleic Acids Res..

[23]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[24]  Leonhard Held,et al.  Spatio-Temporal Analysis of Epidemic Phenomena Using the R Package surveillance , 2014, ArXiv.

[25]  Didier Raoult,et al.  Identification of virulence factors and antibiotic resistance markers using bacterial genomics. , 2016, Future Microbiology.

[26]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[27]  Minoru Kanehisa,et al.  KEGG: new perspectives on genomes, pathways, diseases and drugs , 2016, Nucleic Acids Res..

[28]  M. Campbell,et al.  PANTHER: a library of protein families and subfamilies indexed by function. , 2003, Genome research.

[29]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[30]  Tonia Korves,et al.  Controlled vocabularies for microbial virulence factors. , 2009, Trends in microbiology.

[31]  Kuo-Chen Chou,et al.  A Comparison of Computational Methods for Identifying Virulence Factors , 2012, PloS one.

[32]  Paolo Fontana,et al.  Argot2: a large scale function prediction tool relying on semantic similarity of weighted Gene Ontology terms , 2012, BMC Bioinformatics.

[33]  Jens Keilwagen,et al.  PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R , 2015, Bioinform..

[34]  Lei Chen,et al.  Computationally identifying virulence factors based on KEGG pathways. , 2013, Molecular bioSystems.

[35]  Dinesh Gupta,et al.  Identifying Bacterial Virulent Proteins by Fusing a Set of Classifiers Based on Variants of Chou's Pseudo Amino Acid Composition and on Evolutionary Information , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  Niall J. Haslam,et al.  Towards the Improved Discovery and Design of Functional Peptides: Common Features of Diverse Classes Permit Generalized Prediction of Bioactivity , 2012, PloS one.

[37]  Vineet K. Sharma,et al.  MP3: A Software Tool for the Prediction of Pathogenic Proteins in Genomic and Metagenomic Data , 2014, PloS one.

[38]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[39]  Maulik Shukla,et al.  Curation, integration and visualization of bacterial virulence factors in PATRIC , 2014, Bioinform..

[40]  Kimberly Glass,et al.  Annotation Enrichment Analysis: An Alternative Method for Evaluating the Functional Properties of Gene Sets , 2012, Scientific Reports.

[41]  Steven AR Webb,et al.  Bench-to-bedside review: Bacterial virulence and subversion of host defences , 2008, Critical care.

[42]  Rolf Apweiler,et al.  UniProt archive , 2004, Bioinform..

[43]  Luis Pedro Coelho,et al.  Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper , 2016, bioRxiv.

[44]  Loris Nanni,et al.  Using ensemble of classifiers for predicting HIV protease cleavage sites in proteins , 2009, Amino Acids.

[45]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[46]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[47]  Curtis Huttenhower,et al.  High-Specificity Targeted Functional Profiling in Microbial Communities with ShortBRED , 2015, PLoS Comput. Biol..

[48]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[49]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[50]  D. G. Gibson,et al.  Design and synthesis of a minimal bacterial genome , 2016, Science.

[51]  Dinesh Gupta,et al.  VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens , 2008, BMC Bioinformatics.

[52]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[53]  Arturo Casadevall,et al.  Virulence factors and their mechanisms of action: the view from a damage-response framework. , 2009, Journal of water and health.