FunEffector-Pred: Identification of Fungi Effector by Activate Learning and Genetic Algorithm Sampling of Imbalanced Data

Fungal pathogens have evolved the ability to cause serious plant diseases and threaten the world food security. Fungal effectors are proteins that exploit the host cellular functions to facilitate infection. Effector identification is crucial for disease control in crops and to understand plant-pathogen interactions. However, fungal effector identification has been challenging as most fungal effectors lack of consensus motifs and data imbalance problem. In this study, a fungal effector predictor was designed to effectively learn from an imbalanced dataset. A granular support vector-based under-sampling (GSV-US) strategy combined with a genetic algorithm was used for majority class sampling. When evaluating on an independent test dataset, the FunEffector-Pred significantly outperformed the existing predictors for fungal effector identification. Several informative feature patterns, such as the patterns of Ile, Gly, Val, Leu and Thr, as well as the combination of aromatic amino acids with positively-charged amino acids, are reported for fungal effector identification for the first time.

[1]  A. Krogh,et al.  Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. , 2001, Journal of molecular biology.

[2]  Liang Cheng,et al.  Identification of Alzheimer's Disease-Related Genes Based on Data Integration Method , 2019, Front. Genet..

[3]  Nilanjan Dey,et al.  Morphological Segmentation Analysis and Texture-based Support Vector Machines Classification on Mice Liver Fibrosis Microscopic Images , 2019, Current Bioinformatics.

[4]  Liang Cheng,et al.  Computational and Biological Methods for Gene Therapy. , 2019, Current gene therapy.

[5]  R. Sokal,et al.  Spatial autocorrelation in biology: 1. Methodology , 1978 .

[6]  Kai Li,et al.  iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features , 2019, Molecular therapy. Nucleic acids.

[7]  Wei Chen,et al.  PHYPred: a tool for identifying bacteriophage enzymes and hydrolases , 2016, Virologica Sinica.

[8]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[9]  Yadong Wang,et al.  Predicting human microRNA-disease associations based on support vector machine , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[10]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[11]  Quan Zou,et al.  Incorporating Distance-based Top-n-gram and Random Forest to Identify Electron Transport Proteins. , 2019, Journal of proteome research.

[12]  C. Furlanello,et al.  Recursive feature elimination with random forest for PTR-MS analysis of agroindustrial products , 2006 .

[13]  O. Winther,et al.  Detecting sequence signals in targeting peptides using deep learning , 2019, Life Science Alliance.

[14]  Wei Chen,et al.  Predicting protein structural classes for low-similarity sequences by evaluating different features , 2019, Knowl. Based Syst..

[15]  Bin Liu,et al.  A Review on the Recent Developments of Sequence-based Protein Feature Extraction Methods , 2019, Current Bioinformatics.

[16]  Jijun Tang,et al.  Predicting protein-protein interactions via multivariate mutual information of protein sequences , 2016, BMC Bioinformatics.

[17]  R. Oliver,et al.  Effectors as tools in disease resistance breeding against biotrophic, hemibiotrophic, and necrotrophic plant pathogens. , 2014, Molecular plant-microbe interactions : MPMI.

[18]  Quan Zou,et al.  O‐GlcNAcPRED‐II: an integrated classification algorithm for identifying O‐GlcNAcylation sites based on fuzzy undersampling and a K‐means PCA oversampling technique , 2018, Bioinform..

[19]  Bin Liu,et al.  Fold-LTR-TCP: protein fold recognition based on triadic closure principle , 2019, Briefings Bioinform..

[20]  Jiangning Song,et al.  MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters , 2019, Bioinform..

[21]  Xiangxiang Zeng,et al.  MOEA/HD: A Multiobjective Evolutionary Algorithm Based on Hierarchical Decomposition , 2019, IEEE Transactions on Cybernetics.

[22]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[23]  Xiangxiang Zeng,et al.  An Evolutionary Algorithm Based on Minkowski Distance for Many-Objective Optimization , 2019, IEEE Transactions on Cybernetics.

[24]  Xingpeng Jiang,et al.  Sequence clustering in bioinformatics: an empirical study. , 2018, Briefings in bioinformatics.

[25]  Xiang Chen,et al.  Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites , 2013, Bioinform..

[26]  Ran Su,et al.  CPPred-FL: a sequence-based predictor for large-scale identification of cell-penetrating peptides by feature representation learning , 2018, Briefings Bioinform..

[27]  Hao Lv,et al.  iRNA-m7G: Identifying N7-methylguanosine Sites by Fusing Multiple Features , 2019, Molecular therapy. Nucleic acids.

[28]  Xiangxiang Zeng,et al.  Computing with viruses , 2016, Theor. Comput. Sci..

[29]  Hui Ding,et al.  A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features , 2019, Front. Bioeng. Biotechnol..

[30]  Feng Liu,et al.  A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs , 2016, BMC Bioinformatics.

[31]  John R. Koza,et al.  Genetic programming - on the programming of computers by means of natural selection , 1993, Complex adaptive systems.

[32]  Wei Chen,et al.  Recent Advances in Machine Learning Methods for Predicting Heat Shock Proteins. , 2019, Current drug metabolism.

[33]  Jijun Tang,et al.  Identification of drug-target interactions via multiple information integration , 2017, Inf. Sci..

[34]  Jong-Seok Lee,et al.  A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification , 2016, IMCOM.

[35]  Qinghua Guo,et al.  LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse , 2018, Nucleic Acids Res..

[36]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Yadong Wang,et al.  Signal Transducers and Activators of Transcription-1 (STAT1) Regulates microRNA Transcription in Interferon γ-Stimulated HeLa Cells , 2010, PloS one.

[38]  Leyi Wei,et al.  A novel hierarchical selective ensemble classifier with bioinformatics application , 2017, Artif. Intell. Medicine.

[39]  Abdollah Dehzangi,et al.  PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences , 2019, Bioinform..

[40]  Xiangxiang Zeng,et al.  Identification of cytokine via an improved genetic algorithm , 2014, Frontiers of Computer Science.

[41]  B. Liu,et al.  iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition , 2019, Front. Genet..

[42]  G. Yen,et al.  A Consensus Community-Based Particle Swarm Optimization for Dynamic Community Detection , 2020, IEEE Transactions on Cybernetics.

[43]  Konstantinos D. Tsirigos,et al.  SignalP 5.0 improves signal peptide predictions using deep neural networks , 2019, Nature Biotechnology.

[44]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[45]  Gaotao Shi,et al.  CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. , 2017, Journal of proteome research.

[46]  James K. Hane,et al.  Bioinformatic prediction of plant-pathogenicity effector proteins of fungi. , 2018, Current opinion in microbiology.

[47]  Jana Sperschneider,et al.  EffectorP: predicting fungal effector proteins from secretomes using machine learning. , 2016, The New phytologist.

[48]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[49]  J. Brownstein,et al.  Emerging fungal threats to animal, plant and ecosystem health , 2012, Nature.

[50]  Bin Liu,et al.  DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks , 2019, Briefings Bioinform..

[51]  Shayok Chakraborty,et al.  A Generic Active Learning Framework for Class Imbalance Applications , 2019, BMVC.

[52]  Bouabid El Ouahidi,et al.  Using Genetic Algorithm to Improve Classification of Imbalanced Datasets for Credit Card Fraud Detection , 2018, 2018 2nd Cyber Security in Networking Conference (CSNet).

[53]  Pingping Wang,et al.  Computational Methods for Identifying Similar Diseases , 2019, Molecular therapy. Nucleic acids.

[54]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[55]  Jana Sperschneider,et al.  Improved prediction of fungal effector proteins from secretomes with EffectorP 2.0 , 2018, bioRxiv.

[56]  Jiu-Xin Tan,et al.  Identification of hormone binding proteins based on machine learning methods. , 2019, Mathematical biosciences and engineering : MBE.

[57]  Yu-Dong Cai,et al.  Analysis and Prediction of Nitrated Tyrosine Sites with the mRMR Method and Support Vector Machine Algorithm , 2016 .

[58]  Jian Huang,et al.  A Brief Survey of Machine Learning Methods in Protein Sub-Golgi Localization , 2019, Current Bioinformatics.

[59]  Wei Lin,et al.  A comprehensive overview and evaluation of circular RNA detection tools , 2017, PLoS Comput. Biol..

[60]  John Geraghty,et al.  Genetic Algorithm Performance with Different Selection Strategies in Solving TSP , 2011 .

[61]  Jijun Tang,et al.  Identification of drug-side effect association via multiple information integration with centered kernel alignment , 2019, Neurocomputing.

[62]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[63]  Jiangning Song,et al.  ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides , 2018, Bioinform..

[64]  Jun Hu,et al.  DNAPred: Accurate Identification of DNA-Binding Sites from Protein Sequence by Ensembled Hyperplane-Distance-Based Support Vector Machines , 2019, J. Chem. Inf. Model..

[65]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[66]  Shan Huang,et al.  ECFS-DEA: an ensemble classifier-based feature selection for differential expression analysis on expression profiles , 2020, BMC Bioinformatics.

[67]  Yanwen Li,et al.  A Sequential Segment Based Alpha-Helical Transmembrane Protein Alignment Method , 2018, International journal of biological sciences.

[68]  C. Lee Giles,et al.  Learning on the border: active learning in imbalanced data classification , 2007, CIKM '07.

[69]  Liang Cheng,et al.  gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions , 2019, Nucleic acids research.

[70]  Yanqing Zhang,et al.  SVMs Modeling for Highly Imbalanced Classification , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[71]  Humira Sonah,et al.  Computational Prediction of Effector Proteins in Fungi: Opportunities and Challenges , 2016, Front. Plant Sci..

[72]  Ran Su,et al.  PEPred-Suite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning , 2019, Bioinform..

[73]  Geoffrey I. Webb,et al.  iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences , 2018, Bioinform..

[74]  Muhammad Iqbal,et al.  iACP-GAEnsC: Evolutionary genetic algorithm based ensemble classification of anticancer peptides by utilizing hybrid feature space , 2017, Artif. Intell. Medicine.

[75]  Junwei Han,et al.  psSubpathway: a software package for flexible identification of phenotype-specific subpathways in cancer progression , 2019, Bioinform..

[76]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[77]  Tzong-Yi Lee,et al.  Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences , 2011, Bioinform..

[78]  Q. Zou,et al.  A novel machine learning method for cytokine-receptor interaction prediction. , 2016, Combinatorial chemistry & high throughput screening.

[79]  Jack Y. Yang,et al.  Transcription factor and microRNA regulation in androgen-dependent and -independent prostate cancer cells , 2008, BMC Genomics.

[80]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[81]  Juan He,et al.  sgRNA-PSM: Predict sgRNAs On-Target Activity Based on Position-Specific Mismatch , 2020, Molecular therapy. Nucleic acids.

[82]  Dong Xu,et al.  OMPcontact: An Outer Membrane Protein Inter-Barrel Residue Contact Prediction Method , 2017, J. Comput. Biol..

[83]  Kim Rutherford,et al.  PHI-base: a new interface and further additions for the multi-species pathogen–host interactions database , 2016, Nucleic Acids Res..

[84]  Han Zhang,et al.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , 2019, Nucleic acids research.

[85]  Fang Wang,et al.  MicroRNA Promoter Identification in Arabidopsis Using Multiple Histone Markers , 2015, BioMed research international.

[86]  Qin Ma,et al.  The application of machine learning to disease diagnosis and treatment. , 2019, Mathematical biosciences.

[87]  Xiangrong Liu,et al.  deepDR: a network-based deep learning approach to in silico drug repositioning , 2019, Bioinform..

[88]  Jiu-Xin Tan,et al.  Evaluation of different computational methods on 5-methylcytosine sites identification , 2020, Briefings Bioinform..

[89]  Jana Sperschneider,et al.  Advances and Challenges in Computational Prediction of Effectors from Plant Pathogenic Fungi , 2015, PLoS pathogens.

[90]  Yan Wang,et al.  SVM Learning from Imbalanced Data by GA Sampling for Protein Domain Prediction , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[91]  Xiangxiang Zeng,et al.  Predicting disease-associated circular RNAs using deep forests combined with positive-unlabeled learning methods , 2020, Briefings Bioinform..

[92]  Jakub M. Tomczak,et al.  Boosted SVM with active learning strategy for imbalanced data , 2015, Soft Comput..

[93]  Jie Sun,et al.  DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function , 2018, Bioinform..

[94]  Xiangxiang Zeng,et al.  Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks , 2016, Briefings Bioinform..

[95]  Xiaofeng Liu,et al.  Developing a Multi-Dose Computational Model for Drug-Induced Hepatotoxicity Prediction Based on Toxicogenomics Data , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[96]  Yadong Wang,et al.  MeDReaders: a database for transcription factors that bind to methylated DNA , 2017, Nucleic Acids Res..

[97]  Ying Ju,et al.  Predicting Diabetes Mellitus With Machine Learning Techniques , 2018, Front. Genet..

[98]  Bin Liu,et al.  ProtDec-LTR3.0: Protein Remote Homology Detection by Incorporating Profile-Based Features Into Learning to Rank , 2019, IEEE Access.