PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method

Gram-negative bacteria use various secretion systems to deliver their secreted effectors. Among them, type IV secretion system exists widely in a variety of bacterial species, and secretes type IV secreted effectors (T4SEs), which play vital roles in host-pathogen interactions. However, experimental approaches to identify T4SEs are time- and resource-consuming. In the present study, we aim to develop an in silico stacked ensemble method to predict whether a protein is an effector of type IV secretion system or not based on its sequence information. The protein sequences were encoded by the feature of position specific scoring matrix (PSSM)-composition by summing rows that correspond to the same amino acid residues in PSSM profiles. Based on the PSSM-composition features, we develop a stacked ensemble model PredT4SE-Stack to predict T4SEs, which utilized an ensemble of base-classifiers implemented by various machine learning algorithms, such as support vector machine, gradient boosting machine, and extremely randomized trees, to generate outputs for the meta-classifier in the classification system. Our results demonstrated that the framework of PredT4SE-Stack was a feasible and effective way to accurately identify T4SEs based on protein sequence information. The datasets and source code of PredT4SE-Stack are freely available at http://xbioinfo.sjtu.edu.cn/PredT4SE_Stack/index.php.

[1]  Gabriel Waksman,et al.  Structure of the outer membrane complex of a type IV secretion system , 2009, Nature.

[2]  Zhenhua Li,et al.  DBAC: A simple prediction method for protein binding hot spots based on burial levels and deeply buried atomic contacts , 2011, BMC Systems Biology.

[3]  Zixiang Wang,et al.  Computational identification of binding energy hot spots in protein–RNA complexes using an ensemble approach , 2018, Bioinform..

[4]  Xiangxiang Zeng,et al.  Prediction and Validation of Disease Genes Using HeteSim Scores , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Tal Pupko,et al.  Genome-Scale Identification of Legionella pneumophila Effectors Using a Machine Learning Approach , 2009, PLoS pathogens.

[6]  Quan Zou,et al.  HPSLPred: An Ensemble Multi‐Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source , 2017, Proteomics.

[7]  Shi-Hua Zhang,et al.  DrugE-Rank: improving drug–target interaction prediction of new candidate drugs or targets by ensemble learning to rank , 2016, Bioinform..

[8]  Tao Zeng,et al.  Prediction of heme binding residues from protein sequences with integrative sequence profiles , 2012, Proteome Science.

[9]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[10]  Yi Xiong,et al.  Protein-protein interface hot spots prediction based on a hybrid feature selection strategy , 2018, BMC Bioinformatics.

[11]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Yu Wang,et al.  Effective prediction of bacterial type IV secreted effectors by combined features of both C-termini and N-termini , 2017, Journal of Computer-Aided Molecular Design.

[14]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[15]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[16]  Junfeng Xia,et al.  Exploiting a Reduced Set of Weighted Average Features to Improve Prediction of DNA-Binding Residues from 3D Structures , 2011, PloS one.

[17]  Geoffrey I. Webb,et al.  iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences , 2018, Bioinform..

[18]  Sumaiya Iqbal,et al.  PBRpredict-Suite: a suite of models to predict peptide-recognition domain residues from protein sequence , 2018, Bioinform..

[19]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[20]  Xiaohong Li,et al.  Feature-derived graph regularized matrix factorization for predicting drug side effects , 2018, Neurocomputing.

[21]  Qi Zhao,et al.  IRWNRLPI: Integrating Random Walk and Neighborhood Regularized Logistic Matrix Factorization for lncRNA-Protein Interaction Prediction , 2018, Front. Genet..

[22]  Yi Xiong,et al.  PseUI: Pseudouridine sites identification based on RNA sequence information , 2018, BMC Bioinformatics.

[23]  Zhao-Qing Luo,et al.  Large-scale identification and translocation of type IV secretion substrates by Coxiella burnetii , 2010, Proceedings of the National Academy of Sciences.

[24]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[25]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[26]  Xiangrong Liu,et al.  Sc-ncDNAPred: A Sequence-Based Predictor for Identifying Non-coding DNA in Saccharomyces cerevisiae , 2018, Front. Microbiol..

[27]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[28]  Julie C. Mitchell,et al.  DBSI: DNA-binding site identifier , 2013, Nucleic acids research.

[29]  Ying Ju,et al.  Improving tRNAscan‐SE Annotation Results via Ensemble Classifiers , 2015, Molecular informatics.

[30]  Thomas Nussbaumer,et al.  EffectiveDB—updates and novel features for a better annotation of bacterial secreted proteins and Type III, IV, VI secretion systems , 2015, Nucleic Acids Res..

[31]  Wei Tang,et al.  Tumor origin detection with tissue‐specific miRNA and DNA methylation markers , 2018, Bioinform..

[32]  Yufeng Yao,et al.  SecReT6: a web-based resource for type VI secretion systems found in bacteria. , 2015, Environmental microbiology.

[33]  E. Orlova,et al.  Structure of a Type IV Secretion System Core Complex , 2009, Science.

[34]  Geoffrey I. Webb,et al.  SecretEPDB: a comprehensive web-based resource for secreted effector proteins of the bacterial types III, IV and VI secretion systems , 2017, Scientific Reports.

[35]  Ran Su,et al.  Exploring sequence‐based features for the improved prediction of DNA N4‐methylcytosine sites in multiple species , 2018, Bioinform..

[36]  Yejun Wang,et al.  Prediction of bacterial type IV secreted effectors by C-terminal features , 2014, BMC Genomics.

[37]  Gaotao Shi,et al.  Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[38]  Feng Liu,et al.  A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs , 2016, BMC Bioinformatics.

[39]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[40]  Avdesh Mishra,et al.  StackDPPred: a stacking based prediction of DNA‐binding protein from sequence , 2018, Bioinform..

[41]  Yi Xiong,et al.  GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank , 2017, bioRxiv.

[42]  Xing-Ming Zhao,et al.  APIS: accurate prediction of hot spots in protein interfaces by combining protrusion index with solvent accessibility , 2010, BMC Bioinformatics.

[43]  Wei Chen,et al.  Identifying RNA N6-Methyladenosine Sites in Escherichia coli Genome , 2018, Front. Microbiol..

[44]  Yi Xiong,et al.  An accurate feature‐based method for identifying DNA‐binding residues on protein surfaces , 2011, Proteins.

[45]  Wen Zhang,et al.  The linear neighborhood propagation method for predicting long non-coding RNA-protein interactions , 2018, Neurocomputing.

[46]  Yi Xiong,et al.  PDC-SGB: Prediction of effective drug combinations using a stochastic gradient boosting algorithm. , 2017, Journal of theoretical biology.

[47]  Geoffrey I. Webb,et al.  POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles , 2017, Bioinform..

[48]  Quan Zou,et al.  O‐GlcNAcPRED‐II: an integrated classification algorithm for identifying O‐GlcNAcylation sites based on fuzzy undersampling and a K‐means PCA oversampling technique , 2018, Bioinform..

[49]  Julie C. Mitchell,et al.  KFC2: A knowledge‐based hot spot prediction method based on interface solvation, atomic density, and plasticity features , 2011, Proteins.

[50]  Tal Pupko,et al.  Computational modeling and experimental validation of the Legionella and Coxiella virulence-related type-IVB secretion signal , 2013, Proceedings of the National Academy of Sciences.

[51]  Zixin Deng,et al.  SecReT4: a web-based bacterial type IV secretion system resource , 2012, Nucleic Acids Res..

[52]  Qi Zhao,et al.  Identifying and Exploiting Potential miRNA-Disease Associations With Neighborhood Regularized Logistic Matrix Factorization , 2018, Front. Genet..

[53]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[54]  Juan Liu,et al.  Computational Prediction of Conformational B-Cell Epitopes from Antigen Primary Structures by Ensemble Learning , 2012, PloS one.

[55]  Cangzhi Jia,et al.  4mCPred: machine learning methods for DNA N4‐methylcytosine sites prediction , 2018, Bioinform..

[56]  Lei Wang,et al.  BNPMDA: Bipartite Network Projection for MiRNA–Disease Association prediction , 2018, Bioinform..

[57]  Feng Liu,et al.  Predicting potential drug-drug interactions by integrating chemical, biological, phenotypic and network data , 2017, BMC Bioinformatics.

[58]  Geoffrey I. Webb,et al.  Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI , 2016, Briefings Bioinform..

[59]  Lingyun Zou,et al.  Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles , 2013, Bioinform..