G-DipC: An Improved Feature Representation Method for Short Sequences to Predict the Type of Cargo in Cell-Penetrating Peptides

Cell-penetrating peptides (CPPs) are functional short peptides with high carrying capacity. CPP sequences with targeting functions for the highly efficient delivery of drugs to target cells. In this paper, which is focused on the prediction of the cargo category of CPPs, a biocomputational model is constructed to efficiently distinguish the category of cargo carried by CPPs as macromolecular carriers among the seven known deliverable cargo categories. Based on dipeptide composition (DipC), an improved feature representation method, general dipeptide composition (G-DipC) is proposed for short peptide sequences and can effectively increase the abundance of features represented. Then linear discriminant analysis (LDA) is applied to mine some important low-dimensional features of G-DipC and a predictive model is built with the XGBoost algorithm. Experimental results with five-fold cross validation show that G-DipC improves accuracy by 25 and 5 percent compared with amino acid composition (AAC) and DipC, respectively. G-DipC is even found to be better than tripeptide composition (TipC). Thus, the proposed model provides a novel resource for the study of cell-penetrating peptides, and the improved dipeptide composition G-DipC can be widely adapted to determine the feature representation of other biological sequences.

[1]  Yufei Xia,et al.  A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring , 2017, Expert Syst. Appl..

[2]  Gaotao Shi,et al.  Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Yu-Dong Cai,et al.  Prediction of protein-peptide interaction with nearest neighbor algorithm , 1969 .

[4]  Kinam Park,et al.  Arginine-rich CPPs for improved drug delivery to tumors. , 2012, Journal of controlled release : official journal of the Controlled Release Society.

[5]  Francisco Herrera,et al.  A Study on the Noise Label Influence in Boosting Algorithms: AdaBoost, GBM and XGBoost , 2017, HAIS.

[6]  Quan Zou,et al.  O‐GlcNAcPRED‐II: an integrated classification algorithm for identifying O‐GlcNAcylation sites based on fuzzy undersampling and a K‐means PCA oversampling technique , 2018, Bioinform..

[7]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[8]  Yixue Li,et al.  A novel index of protein-protein interface propensity improves interface residue recognition , 2016, BMC Systems Biology.

[9]  Wei Chen,et al.  Identification of mycobacterial membrane proteins and their types using over-represented tripeptide compositions. , 2012, Journal of proteomics.

[10]  Chen Chu,et al.  Prediction and analysis of cell-penetrating peptides using pseudo-amino acid composition and random forest models , 2015, Amino Acids.

[11]  M. Morris,et al.  Cell-penetrating peptides: tools for intracellular delivery of therapeutics , 2005, Cellular and Molecular Life Sciences CMLS.

[12]  Yong Yu,et al.  A Parallel and Efficient Algorithm for Learning to Match , 2014, 2014 IEEE International Conference on Data Mining.

[13]  Bernhard Schölkopf,et al.  Comparing support vector machines with Gaussian kernels to radial basis function classifiers , 1997, IEEE Trans. Signal Process..

[14]  Chih-Jung Chen,et al.  An Optimized Classification Method for Multi-Classed Sequences , 2013 .

[15]  J.C. Rajapakse,et al.  SVM-RFE With MRMR Filter for Gene Selection , 2010, IEEE Transactions on NanoBioscience.

[16]  Quan Zou,et al.  HPSLPred: An Ensemble Multi‐Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source , 2017, Proteomics.

[17]  C. Mazzuca,et al.  Spectroscopic Properties and Conformational Features of Short Linear Peptides in Solution: A Fluorescence and Molecular Mechanics Investigation , 2004, Journal of Fluorescence.

[18]  Shunfang Wang,et al.  Protein Subcellular Localization with Gaussian Kernel Discriminant Analysis and Its Kernel Parameter Selection , 2017, International journal of molecular sciences.

[19]  Wen-Liang Chen,et al.  Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition , 2012, BMC Bioinformatics.

[20]  Kumardeep Chaudhary,et al.  Cell Penetrating Peptides , 2016 .

[21]  Gajendra P. S. Raghava,et al.  CPPsite: a curated database of cell penetrating peptides , 2012, Database J. Biol. Databases Curation.

[22]  Ülo Langel,et al.  Cell-penetrating peptide secures an efficient endosomal escape of an intact cargo upon a brief photo-induction , 2013, Cellular and Molecular Life Sciences.

[23]  Gajendra P. S. Raghava,et al.  CPPsite 2.0: a repository of experimentally validated cell-penetrating peptides , 2015, Nucleic Acids Res..

[24]  Richard A. Tapia,et al.  Inverse, Shifted Inverse, and Rayleigh Quotient Iteration as Newton's Method , 2018, SIAM Rev..

[25]  Jagdish Singh,et al.  Intracellular Delivery of Molecular Cargo Using Cell-Penetrating Peptides and the Combination Strategies , 2015, International journal of molecular sciences.

[26]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..

[27]  Arun Krishnan,et al.  pSLIP: SVM based protein subcellular localization prediction using multiple physicochemical properties , 2005, BMC Bioinformatics.

[28]  Q. Zou,et al.  SkipCPP-Pred: an improved and promising sequence-based predictor for predicting cell-penetrating peptides , 2017, BMC Genomics.

[29]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[30]  Gajendra P. S. Raghava,et al.  Computer-aided designing of immunosuppressive peptides based on IL-10 inducing potential , 2017, Scientific Reports.

[31]  Xiangxiang Zeng,et al.  Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks , 2016, Briefings Bioinform..

[32]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[33]  K C Chou,et al.  An analysis of protein folding type prediction by seed-propagated sampling and jackknife test , 1995, Journal of protein chemistry.

[34]  T. Akutsu,et al.  Recent Advances in Predicting Functional Impact of Single Amino Acid Polymorphisms: A Review of Useful Features, Computational Methods and Available Tools , 2013 .

[35]  Bhuvana Ramabhadran,et al.  An Iterative Relative Entropy Minimization-Based Data Selection Approach for n-Gram Model Adaptation , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Susan M. Bridges,et al.  Prediction of Cell Penetrating Peptides by Support Vector Machines , 2011, PLoS Comput. Biol..

[37]  Shunfang Wang,et al.  Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA , 2015, International journal of molecular sciences.

[38]  S. Khan,et al.  Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC. , 2017, Journal of theoretical biology.

[39]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[40]  Josef Kittler,et al.  Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[42]  Long Chen,et al.  Short-Term Load Forecasting Using EMD-LSTM Neural Networks with a Xgboost Algorithm for Feature Importance Evaluation , 2017 .

[43]  Jan Draisma,et al.  Markov random fields and iterated toric fibre products , 2016, Adv. Appl. Math..

[44]  Xinwei Zheng,et al.  Radar emitter classification for large data set based on weighted-xgboost , 2017 .

[45]  Gaotao Shi,et al.  CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. , 2017, Journal of proteome research.

[46]  Gianluca Pollastri,et al.  CPPpred: prediction of cell penetrating peptides , 2013, Bioinform..

[47]  A. Chamberlin,et al.  Site-specific incorporation of non-natural residues into peptides: Effect of residue structure on suppression and translation efficiencies , 1991 .

[48]  Jiangning Song,et al.  Predicting Protein N-Terminal Signal Peptides Using Position-Specific Amino Acid Propensities and Conditional Random Fields , 2013 .

[49]  Xing Gao,et al.  mGOF-loc: A novel ensemble learning method for human protein subcellular localization prediction , 2016, Neurocomputing.

[50]  Manfred Kansy,et al.  Lactam-Stapled Cell-Penetrating Peptides: Cell Uptake and Membrane Binding Properties. , 2017, Journal of medicinal chemistry.

[51]  Jeffrey Hill,et al.  A Head-to-Head Comparison of the Antimicrobial Activities of 30 Ultra-Short Antimicrobial Peptides Against Staphylococcus aureus, Pseudomonas aeruginosa and Candida albicans , 2015, International Journal of Peptide Research and Therapeutics.

[52]  Gajendra P. S. Raghava,et al.  CancerPPD: a database of anticancer peptides and proteins , 2014, Nucleic Acids Res..

[53]  D. Rossi,et al.  Cell-Penetrating Peptides: From Basic Research to Clinics. , 2017, Trends in pharmacological sciences.

[54]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.