Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types

MOTIVATION With the rapid increase of infection resistance to antibiotics, it is urgent to find novel infection therapeutics. In recent years, antimicrobial peptides (AMPs) have been utilized as potential alternatives for infection therapeutics. AMPs are key components of the innate immune system and can protect the host from various pathogenic bacteria. Identifying AMPs and their functional types has led to many studies, and various predictors using machine learning have been developed. However, there is room for improvement; in particular, no predictor takes into account the lack of balance among different functional AMPs. RESULTS In this paper, a new synthetic minority over-sampling technique on imbalanced and multi-label datasets, referred to as ML-SMOTE, was designed for processing and identifying AMPs' functional families. A novel multi-label classifier, MLAMP, was also developed using ML-SMOTE and grey pseudo amino acid composition. The classifier obtained 0.4846 subset accuracy and 0.16 hamming loss. AVAILABILITY AND IMPLEMENTATION A user-friendly web-server for MLAMP was established at http://www.jci-bioinfo.cn/MLAMP CONTACTS: linweizhong@jci.edu.cn or xudong@missouri.edu.

[1]  K. Chou,et al.  Virus-mPLoc: A Fusion Classifier for Viral Protein Subcellular Location Prediction by Incorporating Multiple Sites , 2010, Journal of biomolecular structure & dynamics.

[2]  K. Chou,et al.  pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties. , 2016, Analytical biochemistry.

[3]  Taghi M. Khoshgoftaar,et al.  RUSBoost: Improving classification performance when training data is skewed , 2008, 2008 19th International Conference on Pattern Recognition.

[4]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[5]  K. Chou,et al.  Prediction of Antimicrobial Peptides Based on Sequence Alignment and Feature Selection Methods , 2011, PloS one.

[6]  K. Chou,et al.  iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. , 2015, Analytical biochemistry.

[7]  Gajendra P. S. Raghava,et al.  Analysis and prediction of antibacterial peptides , 2007, BMC Bioinformatics.

[8]  M. Lukic,et al.  Potential therapeutic applications of multifunctional host-defense peptides from frog skin as anti-cancer, anti-viral, immunomodulatory, and anti-diabetic agents , 2014, Peptides.

[9]  K. Chou,et al.  Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. , 2010, Journal of theoretical biology.

[10]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[11]  Francisco Charte,et al.  Addressing imbalance in multilabel classification: Measures and random resampling algorithms , 2015, Neurocomputing.

[12]  Boris Vishnepolsky,et al.  Prediction of Linear Cationic Antimicrobial Peptides Based on Characteristics Responsible for Their Interaction with the Membranes , 2013, J. Chem. Inf. Model..

[13]  K. Chou,et al.  iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. , 2011, Journal of theoretical biology.

[14]  K. Chou,et al.  iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins , 2011, PloS one.

[15]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[16]  J. A. Baranauskas,et al.  Prediction of Antimicrobial Activity of Synthetic Peptides by a Decision Tree Model , 2013, Applied and Environmental Microbiology.

[17]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[18]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[19]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[20]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[21]  Vladimir B. Bajic,et al.  Distinct profiling of antimicrobial peptide families , 2014, Bioinform..

[22]  Suzana Popovic,et al.  Peptides with antimicrobial and anti-inflammatory activities that have therapeutic potential for treatment of acne vulgaris , 2012, Peptides.

[23]  K. Chou,et al.  iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC , 2016, Oncotarget.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Jiun-Hung Chen,et al.  A multi-label classification based approach for sentiment classification , 2015, Expert Syst. Appl..

[26]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[27]  K. Chou,et al.  iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. , 2011, Molecular bioSystems.

[28]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[29]  Sheng Chen,et al.  A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems , 2011, Neurocomputing.

[30]  Víctor Robles,et al.  Feature selection for multi-label naive Bayes classification , 2009, Inf. Sci..

[31]  Xia Li,et al.  APD2: the updated antimicrobial peptide database and its application in peptide design , 2008, Nucleic Acids Res..

[32]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[33]  Chumphol Bunkhumpornpat,et al.  Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem , 2009, PAKDD.

[34]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[35]  M. Lefranc,et al.  DBAASP: database of antimicrobial activity and structure of peptides. , 2014, FEMS microbiology letters.

[36]  A. Bahar,et al.  Antimicrobial Peptides , 2013, Pharmaceuticals.

[37]  Francisco Herrera,et al.  Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling , 2011, Soft Comput..

[38]  Mandana Behbahani,et al.  Predicting antibacterial peptides by the concept of Chou's pseudo-amino acid composition and machine learning methods. , 2012, Protein and peptide letters.

[39]  Kuo-Chen Chou,et al.  iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. , 2016, Analytical biochemistry.

[40]  Sheng Chen,et al.  On combination of SMOTE and particle swarm optimization based radial basis function classifier for imbalanced problems , 2011, The 2011 International Joint Conference on Neural Networks.

[41]  Artem Cherkasov,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm068 Databases and ontologies AMPer: a database and an automated discovery tool for antimicrobial peptides , 2022 .

[42]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[43]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[44]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[45]  Wenqi Liu,et al.  Imbalanced Multi-Modal Multi-Label Learning for Subcellular Localization Prediction of Human Proteins with Both Single and Multiple Sites , 2012, PloS one.

[46]  Shreyas Karnik,et al.  CAMP: a useful resource for research on antimicrobial peptides , 2009, Nucleic Acids Res..

[47]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[48]  K. Chou,et al.  iLoc-Gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins. , 2012, Protein and peptide letters.

[49]  Hongfei Lin,et al.  Gene Function Prediction Based on the Gene Ontology Hierarchical Structure , 2014, PloS one.

[50]  Sun-Yuan Kung,et al.  mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines , 2012, BMC Bioinformatics.

[51]  J. Deng,et al.  Introduction to Grey system theory , 1989 .

[52]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[53]  Shreyas Karnik,et al.  ClassAMP: A Prediction Tool for Classification of Antimicrobial Peptides , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[54]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[55]  Zhi-Hua Zhou,et al.  ML-KNN: A lazy learning approach to multi-label learning , 2007, Pattern Recognit..

[56]  G. Spyrou,et al.  C-PAmP: Large Scale Analysis and Database Construction Containing High Scoring Computationally Predicted Antimicrobial Peptides for All the Available Plant Species , 2013, PloS one.

[57]  A. Šmelcerović,et al.  Synthesis, structure and antimicrobial activity of 6-(propan-2-yl)-3-methyl-morpholine-2,5-dione , 2012 .

[58]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[59]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[60]  Gajendra P. S. Raghava,et al.  AntiBP2: improved version of antibacterial peptide prediction , 2010, BMC Bioinformatics.

[61]  K. Chou,et al.  Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. , 2007, Journal of proteome research.

[62]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[63]  K. Chou,et al.  Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. , 2007, Biochemical and biophysical research communications.

[64]  Sven Laur,et al.  seqlm: an MDL based method for identifying differentially methylated regions in high density methylation array data , 2016, Bioinform..

[65]  Kuo-Chen Chou,et al.  iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets , 2016, Molecules.

[66]  Xiaowei Zhao,et al.  LAMP: A Database Linking Antimicrobial Peptides , 2013, PloS one.

[67]  Guangshun Wang,et al.  Human Antimicrobial Peptides and Proteins , 2014, Pharmaceuticals.

[68]  Kuo-Chen Chou,et al.  Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of Gram-positive bacterial proteins. , 2009, Protein and peptide letters.

[69]  K. Duraiswamy,et al.  MLDSS: An Algorithm to Mine Multi-Label Disease Spreading Sequence Using Spatio-Time Interval Database , 2015 .

[70]  Kuo-Chen Chou,et al.  iROS-gPseKNC: Predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition , 2016, Oncotarget.

[71]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[72]  K. Chou,et al.  iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. , 2013, Analytical biochemistry.

[73]  M. Castanho,et al.  From antimicrobial to anticancer peptides. A review , 2013, Front. Microbiol..

[74]  Xuehua Wang,et al.  A New Over-Sampling Approach: Random-SMOTE for Learning from Imbalanced Data Sets , 2011, KSEM.

[75]  Faiza Hanif Waghu,et al.  CAMP: Collection of sequences and structures of antimicrobial peptides , 2013, Nucleic Acids Res..

[76]  Dong Xu,et al.  iPhos‐PseEvo: Identifying Human Phosphorylated Proteins by Incorporating Evolutionary Information into General PseAAC via Grey System Theory , 2017, Molecular informatics.

[77]  K. Chou,et al.  iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. , 2013, Molecular bioSystems.

[78]  Josef Kittler,et al.  Inverse random under sampling for class imbalance problem and its application to multi-label classification , 2012, Pattern Recognit..

[79]  Xiang Cheng,et al.  iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach , 2015, Journal of biomolecular structure & dynamics.

[80]  Kuo-Chen Chou,et al.  pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. , 2016, Journal of theoretical biology.

[81]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[82]  Min Zhu,et al.  Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions , 2012, Comput. Biol. Chem..

[83]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.