Predicting antifreeze proteins with weighted generalized dipeptide composition and multi-regression feature selection ensemble

Background Antifreeze proteins (AFPs) are a group of proteins that inhibit body fluids from growing to ice crystals and thus improve biological antifreeze ability. It is vital to the survival of living organisms in extremely cold environments. However, little research is performed on sequences feature extraction and selection for antifreeze proteins classification in the structure and function prediction, which is of great significance. Results In this paper, to predict the antifreeze proteins, a feature representation of weighted generalized dipeptide composition (W-GDipC) and an ensemble feature selection based on two-stage and multi-regression method (LRMR-Ri) are proposed. Specifically, four feature selection algorithms: Lasso regression, Ridge regression, Maximal information coefficient and Relief are used to select the feature sets, respectively, which is the first stage of LRMR-Ri method. If there exists a common feature subset among the above four sets, it is the optimal subset; otherwise we use Ridge regression to select the optimal subset from the public set pooled by the four sets, which is the second stage of LRMR-Ri. The LRMR-Ri method combined with W-GDipC was performed both on the antifreeze proteins dataset (binary classification), and on the membrane protein dataset (multiple classification). Experimental results show that this method has good performance in support vector machine (SVM), decision tree (DT) and stochastic gradient descent (SGD). The values of ACC, RE and MCC of LRMR-Ri and W-GDipC with antifreeze proteins dataset and SVM classifier have reached as high as 95.56%, 97.06% and 0.9105, respectively, much higher than those of each single method: Lasso, Ridge, Mic and Relief, nearly 13% higher than single Lasso for ACC. Conclusion The experimental results show that the proposed LRMR-Ri and W-GDipC method can significantly improve the accuracy of antifreeze proteins prediction compared with other similar single feature methods. In addition, our method has also achieved good results in the classification and prediction of membrane proteins, which verifies its widely reliability to a certain extent.

[1]  Peter L Davies,et al.  Structure and function of antifreeze proteins. , 2002, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[2]  De-Shuang Huang,et al.  MultiP-SChlo: Multi-label protein subchloroplast localization prediction , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[3]  Shunfang Wang,et al.  Prediction of oxidoreductase subfamily classes based on RFE-SND-CC-PSSM and machine learning methods , 2019, J. Bioinform. Comput. Biol..

[4]  Hui Ding,et al.  Machine learning and its applications in plant molecular studies. , 2019, Briefings in functional genomics.

[5]  Geoffrey I. Webb,et al.  POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles , 2017, Bioinform..

[6]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[7]  Hao Lin,et al.  Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant. , 2007, Biochemical and biophysical research communications.

[8]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[9]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[10]  Hio Kuan Tai,et al.  Deep-AmPEP30: Improve Short Antimicrobial Peptides Prediction with Deep Learning , 2020, Molecular therapy. Nucleic acids.

[11]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..

[12]  Jianding Qiu,et al.  Using support vector machines to distinguish enzymes: approached by incorporating wavelet transform. , 2009, Journal of theoretical biology.

[13]  Zeyu Wen,et al.  Topology-independent and global protein structure alignment through an FFT-based algorithm , 2020, Bioinform..

[14]  Runtao Yang,et al.  An Effective Antifreeze Protein Predictor with Ensemble Classifiers and Comprehensive Sequence Descriptors , 2015, International journal of molecular sciences.

[15]  Ganesan Pugalenthi,et al.  Predicting protein structural class by SVM with class-wise optimized features and decision probabilities. , 2008, Journal of theoretical biology.

[16]  Virapong Prachayasittikul,et al.  CryoProtect: A Web Server for Classifying Antifreeze Proteins from Nonantifreeze Proteins , 2017 .

[17]  Yadong Wang,et al.  Predicting human microRNA-disease associations based on support vector machine , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[18]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[19]  K. Pihakaski-Maunsbach,et al.  Antifreeze proteins in winter rye , 1997 .

[20]  W. Doolittle,et al.  Origin of antifreeze protein genes: a cool tale in molecular evolution. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[22]  Shengyou Huang,et al.  Determination of an effective scoring function for RNA–RNA interactions with a physics-based double-iterative method , 2018, Nucleic acids research.

[23]  Shunfang Wang,et al.  Protein Subcellular Localization with Gaussian Kernel Discriminant Analysis and Its Kernel Parameter Selection , 2017, International journal of molecular sciences.

[24]  Harish Sharma,et al.  A Survey on Parallel Particle Swarm Optimization Algorithms , 2019, Arabian Journal for Science and Engineering.

[26]  N. Najimudin,et al.  Large-Scale Production of Glaciozyma antarctica Antifreeze Protein 1 (Afp1) by Fed-Batch Fermentation of Pichia pastoris , 2018 .

[27]  De-Shuang Huang,et al.  Novel 20-D descriptors of protein sequences and it’s applications in similarity analysis , 2012 .

[28]  Xingpeng Jiang,et al.  Sequence clustering in bioinformatics: an empirical study. , 2018, Briefings in bioinformatics.

[29]  Gajendra P. S. Raghava,et al.  CPPsite 2.0: a repository of experimentally validated cell-penetrating peptides , 2015, Nucleic Acids Res..

[30]  Hong-Bin Shen,et al.  TargetFreeze: Identifying Antifreeze Proteins via a Combination of Weights using Sequence Evolutionary Information and Pseudo Amino Acid Composition , 2015, The Journal of Membrane Biology.

[31]  De-Shuang Huang,et al.  Multi-sub-swarm particle swarm optimization algorithm for multimodal function optimization , 2007, 2007 IEEE Congress on Evolutionary Computation.

[32]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[33]  Sukanta Mondal,et al.  Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. , 2014, Journal of theoretical biology.

[34]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[35]  Mohammed Rokibul Alam Kotwal,et al.  Bangla text document categorization using Stochastic Gradient Descent (SGD) classifier , 2015, 2015 International Conference on Cognitive Computing and Information Processing(CCIP).

[36]  Zicheng Cao,et al.  G-DipC: An Improved Feature Representation Method for Short Sequences to Predict the Type of Cargo in Cell-Penetrating Peptides , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[37]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[38]  Xuan Xiao,et al.  iAFP-Ense: An Ensemble Classifier for Identifying Antifreeze Protein by Incorporating Grey Model and PSSM into PseAAC , 2016, The Journal of Membrane Biology.

[39]  Reza Ebrahimpour,et al.  PPIevo: protein-protein interaction prediction from PSSM based evolutionary information. , 2013, Genomics.

[40]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[41]  K. Chou,et al.  Physics and chemistry-driven artificial neural network for predicting bioactivity of peptides and proteins and their design. , 2009, Journal of theoretical biology.

[42]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[43]  Shunfang Wang,et al.  Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA , 2015, International journal of molecular sciences.

[44]  Francisco Herrera,et al.  Study on the Impact of Partition-Induced Dataset Shift on $k$-Fold Cross-Validation , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[45]  Mohammed Bennamoun,et al.  RAFP-Pred: Robust Prediction of Antifreeze Proteins Using Localized Analysis of n-Peptide Compositions , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  Balachandran Manavalan,et al.  Machine intelligence in peptide therapeutics: A next‐generation tool for rapid disease screening , 2020, Medicinal research reviews.

[47]  Shunfang Wang,et al.  Prediction of protein structural classes by different feature expressions based on 2-D wavelet denoising and fusion , 2019, BMC Bioinformatics.

[48]  Shunfang Wang,et al.  Protein subnuclear localization based on a new effective representation and intelligent kernel linear discriminant analysis by dichotomous greedy genetic algorithm , 2018, PloS one.

[49]  Abhigyan Nath,et al.  The role of pertinently diversified and balanced training as well as testing data sets in achieving the true performance of classifiers in predicting the antifreeze proteins , 2018, Neurocomputing.

[50]  Xiaowei Zhao,et al.  Using Support Vector Machine and Evolutionary Profiles to Predict Antifreeze Protein Sequences , 2012, International journal of molecular sciences.

[51]  Lei Guo,et al.  Efficient utilization on PSSM combining with recurrent neural network for membrane protein types prediction , 2019, Comput. Biol. Chem..

[52]  Zheru Chi,et al.  Finding complex roots of polynomials by feedforward neural networks , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[53]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[54]  Jing Chen,et al.  Using general master equation for feature fusion , 2018, Future Gener. Comput. Syst..

[55]  P. Suganthan,et al.  AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. , 2011, Journal of theoretical biology.

[56]  David Zhang,et al.  Palmprint verification based on principal lines , 2008, Pattern Recognit..

[57]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[58]  Yanling Li,et al.  High-Throughput Identification of Mammalian Secreted Proteins Using Species-Specific Scheme and Application to Human Proteome , 2018, Molecules.

[59]  John W. Kanwisher,et al.  Supercooling and osmoregulation in arctic fish , 1957 .

[60]  Zili Zhang,et al.  Predicting Protein Function Using Multiple Kernels , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.