Learning transferable deep convolutional neural networks for the classification of bacterial virulence factors

MOTIVATION Identification of virulence factors (VFs) is critical to the elucidation of bacterial pathogenesis and prevention of related infectious diseases. Current computational methods for VF prediction focus on binary classification or involve only several class(es) of VFs with sufficient samples. However, thousands of VF classes are present in real-world scenarios, and many of them only have a very limited number of samples available. RESULTS We first construct a large VF dataset, covering 3,446 VF classes with 160,495 sequences, and then propose deep convolutional neural network (CNN) models for VF classification. We show that (i) for common VF classes with sufficient samples, our models can achieve state-of-the-art performance with an overall accuracy of 0.9831 and an F1-score of 0.9803; (ii) for uncommon VF classes with limited samples, our models can learn transferable features from auxiliary data and achieve good performance with accuracy ranging from 0.9277 to 0.9512 and F1-score ranging from 0.9168 to 0.9446 when combined with different predefined features, outperforming traditional classifiers by 1%-13% in accuracy and by 1%-16% in F1-score. AVAILABILITY All of our datasets are made publicly available at http://www.mgc.ac.cn/VFNet/, and the source code of our models is publicly available at https://github.com/zhengdd0422/VFNet. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Lei Chen,et al.  Computationally identifying virulence factors based on KEGG pathways. , 2013, Molecular bioSystems.

[2]  Wei Pan,et al.  A simple convolutional neural network for prediction of enhancer-promoter interactions with DNA sequence data , 2019, Bioinform..

[3]  Lingyun Zou,et al.  Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles , 2013, Bioinform..

[4]  Silvio C. E. Tosatto,et al.  The Pfam protein families database in 2019 , 2018, Nucleic Acids Res..

[5]  Anil Kumar,et al.  SSPred: A prediction server based on SVM for the identification and classification of proteins involved in bacterial secretion systems , 2011, Bioinformation.

[6]  Gooitzen M van Dam,et al.  Targeted imaging of bacterial infections: advances, hurdles and hopes. , 2015, FEMS microbiology reviews.

[7]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[8]  Seokjun Seo,et al.  DeepFam: deep learning based alignment-free method for protein family modeling and prediction , 2018, Bioinform..

[9]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[10]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[11]  Vineet K. Sharma,et al.  MP3: A Software Tool for the Prediction of Pathogenic Proteins in Genomic and Metagenomic Data , 2014, PloS one.

[12]  Pedro Manuel Martínez-García,et al.  T346Hunter: A Novel Web-Based Tool for the Prediction of Type III, Type IV and Type VI Secretion Systems in Bacterial Genomes , 2015, PloS one.

[13]  Cong Zeng,et al.  An account of in silico identification tools of secreted effector proteins in bacteria and future challenges , 2019, Briefings Bioinform..

[14]  Gary King,et al.  Logistic Regression in Rare Events Data , 2001, Political Analysis.

[15]  Srinivasan Ramachandran,et al.  SPAAN: a software program for prediction of adhesins and adhesin-like proteins using neural networks , 2004, Bioinform..

[16]  Iddo Friedberg,et al.  Identifying antimicrobial peptides using word embedding with deep recurrent neural networks , 2018, bioRxiv.

[17]  Amarda Shehu,et al.  Deep learning improves antimicrobial peptide recognition , 2018, Bioinform..

[18]  Taghi M. Khoshgoftaar,et al.  A survey of transfer learning , 2016, Journal of Big Data.

[19]  Tzong-Yi Lee,et al.  Incorporating Amino Acids Composition and Functional Domains for Identifying Bacterial Toxin Proteins , 2014, BioMed research international.

[20]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[21]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[22]  Jiangning Song,et al.  Bastion6: a bioinformatics approach for accurate prediction of type VI secreted effectors , 2018, Bioinform..

[23]  S. J. Billington,et al.  Identification and role in virulence of putative iron acquisition genes from Corynebacterium pseudotuberculosis. , 2002, FEMS microbiology letters.

[24]  Menglong Li,et al.  SecretP: identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition. , 2010, Journal of theoretical biology.

[25]  Jian Yang,et al.  VFDB 2019: a comparative pathogenomic platform with an interactive web interface , 2018, Nucleic Acids Res..

[26]  Paul Geladi,et al.  Principal Component Analysis , 1987, Comprehensive Chemometrics.

[27]  Antonino Fiannaca,et al.  Deep learning models for bacteria taxonomic classification of metagenomic data , 2018, BMC Bioinformatics.

[28]  Yejun Wang,et al.  Prediction of bacterial type IV secreted effectors by C-terminal features , 2014, BMC Genomics.

[29]  B. Vinatzer,et al.  Bioinformatics correctly identifies many type III secretion substrates in the plant pathogen Pseudomonas syringae and the biocontrol isolate P. fluorescens SBW25. , 2005, Molecular plant-microbe interactions : MPMI.

[30]  Liangjiang Wang,et al.  Deep learning of the back-splicing code for circular RNA formation , 2019, Bioinform..

[31]  Yu Li,et al.  Promoter analysis and prediction in the human genome using sequence-based deep learning models , 2019, Bioinform..

[32]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[33]  S. O’Brien,et al.  Evaluation and Integration of Genetic Signature for Prediction Risk of Nasopharyngeal Carcinoma in Southern China , 2014, BioMed research international.

[34]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[35]  Jian Yang,et al.  VFDB 2016: hierarchical and refined dataset for big data analysis—10 years on , 2015, Nucleic Acids Res..

[36]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[37]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[38]  Dinesh Gupta,et al.  VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens , 2008, BMC Bioinformatics.

[39]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[40]  F. Cordes,et al.  Helical Structure of the Needle of the Type III Secretion System of Shigella flexneri * , 2003, The Journal of Biological Chemistry.

[41]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.