PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only

Many recent efforts have been made for the development of machine learning-based methods for fast and accurate phosphorylation site prediction. Currently, a majority of well-performing methods are based on hybrid information to build prediction models, such as evolutionary information, disorder information, and so on. Unfortunately, this type of methods suffers two major limitations: one is that it would not be much of help for protein phosphorylation site prediction in case of no obvious homology detected; the other is that computing such the complicated information is time-consuming, which probably limits the usage of predictors in practical applications. In this paper, we present a simple, fast, and powerful feature representation algorithm, which sufficiently explores the sequential information from multiple perspectives only based on primary sequences, and successfully captures the differences between true phosphorylation sites and hboxnon-phosphorylation sites. Using the proposed features, we propose a random forest-based predictor named PhosPred-RF in the prediction of protein phosphorylation sites from proteins. We evaluate and compare the proposed predictor with the state-of-the-art predictors on some benchmark data sets. The experimental results show that PhosPred-RF outperforms other existing predictors, demonstrating its potential to be a useful tool for protein phosphorylation site prediction. Currently, the proposed PhosPred-RF is freely accessible to the public through the user-friendly webserver http://server.malab.cn/PhosPred-RF.

[1]  George C. Runger,et al.  Bias of Importance Measures for Multi-valued Attributes and Solutions , 2011, ICANN.

[2]  Yu Xue,et al.  GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. , 2011, Protein engineering, design & selection : PEDS.

[3]  T. Hunter,et al.  Signaling—2000 and Beyond , 2000, Cell.

[4]  Anthony J. Kusalik,et al.  Computational prediction of eukaryotic phosphorylation sites , 2011, Bioinform..

[5]  Q Zou,et al.  Novel representation of RNA secondary structure used to improve prediction algorithms. , 2011, Genetics and molecular research : GMR.

[6]  K. Chou Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[7]  Wei Chen,et al.  Prediction of phosphothreonine sites in human proteins by fusing different features , 2016, Scientific Reports.

[8]  Jie Wang,et al.  Discriminative pattern mining and its applications in bioinformatics , 2015, Briefings Bioinform..

[9]  Jun Wu,et al.  Mining Conditional Phosphorylation Motifs , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Dong Xu,et al.  Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites* , 2010, Molecular & Cellular Proteomics.

[11]  Xuan Liu,et al.  Identification of DNA-Binding Proteins by Combining Auto-Cross Covariance Transformation and Ensemble Learning , 2016, IEEE Transactions on NanoBioscience.

[12]  Subhadip Basu,et al.  AMS 3.0: prediction of post-translational modifications , 2010, BMC Bioinformatics.

[13]  Wei Chen,et al.  Identification of apolipoprotein using feature selection technique , 2016, Scientific Reports.

[14]  Ashis Kumer Biswas,et al.  Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information , 2010, BMC Bioinformatics.

[15]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[16]  Gianluca Valentino,et al.  Machine learning techniques for protein function prediction , 2020, Proteins.

[17]  Bo Yao,et al.  PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine , 2014, Amino Acids.

[18]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[19]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[20]  B. Liu,et al.  DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation , 2015, Scientific Reports.

[21]  L. Iakoucheva,et al.  The importance of intrinsic disorder for protein phosphorylation. , 2004, Nucleic acids research.

[22]  Achuthsankar S. Nair,et al.  Composition, Transition and Distribution (CTD) — A dynamic feature for predictions based on hierarchical structure of cellular sorting , 2011, 2011 Annual IEEE India Conference.

[23]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[24]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[25]  Junjie Chen,et al.  A comprehensive review and comparison of different computational methods for protein remote homology detection , 2018, Briefings Bioinform..

[26]  Qiwen Dong,et al.  Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. , 2016, IEEE transactions on nanobioscience.

[27]  Xing Gao,et al.  An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information , 2015, IEEE Transactions on NanoBioscience.

[28]  Wei Chen,et al.  iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition , 2016, Oncotarget.

[29]  Wei Chen,et al.  Predicting cancerlectins by the optimal g-gap dipeptides , 2015, Scientific Reports.

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[32]  Q. Zou,et al.  Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition , 2016, International journal of molecular sciences.

[33]  P Vallotton,et al.  Detection of tubule boundaries based on circular shortest path and polar‐transformation of arbitrary shapes , 2016, Journal of microscopy.

[34]  Hua Tang,et al.  Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. , 2016, Molecular bioSystems.

[35]  Jun Wu,et al.  Data construction for phosphorylation site prediction , 2014, Briefings Bioinform..

[36]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[37]  N. Blom,et al.  Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. , 1999, Journal of molecular biology.

[38]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[39]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[40]  Joachim Selbig,et al.  PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor , 2007, Nucleic Acids Res..

[41]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[42]  Ying Ju,et al.  Finding the Best Classification Threshold in Imbalanced Classification , 2016, Big Data Res..