AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest

Antimicrobial peptides (AMPs) are promising candidates in the fight against multidrug-resistant pathogens owing to AMPs’ broad range of activities and low toxicity. Nonetheless, identification of AMPs through wet-lab experiments is still expensive and time consuming. Here, we propose an accurate computational method for AMP prediction by the random forest algorithm. The prediction model is based on the distribution patterns of amino acid properties along the sequence. Using our collection of large and diverse sets of AMP and non-AMP data (3268 and 166791 sequences, respectively), we evaluated 19 random forest classifiers with different positive:negative data ratios by 10-fold cross-validation. Our optimal model, AmPEP with the 1:3 data ratio, showed high accuracy (96%), Matthew’s correlation coefficient (MCC) of 0.9, area under the receiver operating characteristic curve (AUC-ROC) of 0.99, and the Kappa statistic of 0.9. Descriptor analysis of AMP/non-AMP distributions by means of Pearson correlation coefficients revealed that reduced feature sets (from a full-featured set of 105 to a minimal-feature set of 23) can result in comparable performance in all respects except for some reductions in precision. Furthermore, AmPEP outperformed existing methods in terms of accuracy, MCC, and AUC-ROC when tested on benchmark datasets.

[1]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[2]  Bakhtiar Affendi Rosdi,et al.  Prediction of Antimicrobial Peptides Based on Sequence Alignment and Support Vector Machine-Pairwise Algorithm Utilizing LZ-Complexity , 2015, BioMed research international.

[3]  Gaotao Shi,et al.  Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Achuthsankar S. Nair,et al.  Composition, Transition and Distribution (CTD) — A dynamic feature for predictions based on hierarchical structure of cellular sorting , 2011, 2011 Annual IEEE India Conference.

[5]  Rakesh Kumar,et al.  dPABBs: A Novel in silico Approach for Predicting and Designing Anti-biofilm Peptides , 2016, Scientific Reports.

[6]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[7]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[8]  Simon Fong,et al.  Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms , 2016, The Journal of Supercomputing.

[9]  Gajendra P. S. Raghava,et al.  AntiBP2: improved version of antibacterial peptide prediction , 2010, BMC Bioinformatics.

[10]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[11]  A. Bahar,et al.  Antimicrobial Peptides , 2013, Pharmaceuticals.

[12]  Ying Ju,et al.  Prediction of G Protein-Coupled Receptors with SVM-Prot Features and Random Forest , 2016, Scientifica.

[13]  Shreyas Karnik,et al.  CAMP: a useful resource for research on antimicrobial peptides , 2009, Nucleic Acids Res..

[14]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[15]  Xia Li,et al.  APD3: the antimicrobial peptide database as a tool for research and education , 2015, Nucleic Acids Res..

[16]  Manoj Kumar,et al.  AVPpred: collection and prediction of highly effective antiviral peptides , 2012, Nucleic Acids Res..

[17]  Faiza Hanif Waghu,et al.  CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides , 2015, Nucleic Acids Res..

[18]  Jijun Tang,et al.  PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only , 2017, IEEE Transactions on NanoBioscience.

[19]  Seong-Cheol Park,et al.  The Role of Antimicrobial Peptides in Preventing Multidrug-Resistant Bacterial Infections and Biofilm Formation , 2011, International journal of molecular sciences.

[20]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Prabina Kumar Meher,et al.  Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC , 2017, Scientific Reports.

[22]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[23]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[24]  Faiza Hanif Waghu,et al.  CAMP: Collection of sequences and structures of antimicrobial peptides , 2013, Nucleic Acids Res..

[25]  Xiaowei Zhao,et al.  LAMP: A Database Linking Antimicrobial Peptides , 2013, PloS one.

[26]  Leyi Wei,et al.  A novel hierarchical selective ensemble classifier with bioinformatics application , 2017, Artif. Intell. Medicine.

[27]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[28]  Wei Tang,et al.  Tumor origin detection with tissue‐specific miRNA and DNA methylation markers , 2018, Bioinform..

[29]  Riadh Hammami,et al.  Current trends in antimicrobial agent research: chemo- and bioinformatics approaches. , 2010, Drug discovery today.

[30]  K. Chou,et al.  iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. , 2013, Analytical biochemistry.

[31]  K. Chou,et al.  Prediction of Antimicrobial Peptides Based on Sequence Alignment and Feature Selection Methods , 2011, PloS one.

[32]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[33]  Xia Li,et al.  APD2: the updated antimicrobial peptide database and its application in peptide design , 2008, Nucleic Acids Res..

[34]  Chen Lin,et al.  LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy , 2014, Neurocomputing.

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.