论文信息 - PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only - 字舞流文

PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only

Many recent efforts have been made for the development of machine learning-based methods for fast and accurate phosphorylation site prediction. Currently, a majority of well-performing methods are based on hybrid information to build prediction models, such as evolutionary information, disorder information, and so on. Unfortunately, this type of methods suffers two major limitations: one is that it would not be much of help for protein phosphorylation site prediction in case of no obvious homology detected; the other is that computing such the complicated information is time-consuming, which probably limits the usage of predictors in practical applications. In this paper, we present a simple, fast, and powerful feature representation algorithm, which sufficiently explores the sequential information from multiple perspectives only based on primary sequences, and successfully captures the differences between true phosphorylation sites and hboxnon-phosphorylation sites. Using the proposed features, we propose a random forest-based predictor named PhosPred-RF in the prediction of protein phosphorylation sites from proteins. We evaluate and compare the proposed predictor with the state-of-the-art predictors on some benchmark data sets. The experimental results show that PhosPred-RF outperforms other existing predictors, demonstrating its potential to be a useful tool for protein phosphorylation site prediction. Currently, the proposed PhosPred-RF is freely accessible to the public through the user-friendly webserver http://server.malab.cn/PhosPred-RF.

Jijun Tang | Quan Zou | Leyi Wei | Pengwei Xing | Q. Zou | Leyi Wei | Jijun Tang | Pengwei Xing

[1] George C. Runger,et al. Bias of Importance Measures for Multi-valued Attributes and Solutions , 2011, ICANN.

[2] Yu Xue,et al. GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. , 2011, Protein engineering, design & selection : PEDS.

[3] T. Hunter,et al. Signaling—2000 and Beyond , 2000, Cell.

[4] Anthony J. Kusalik,et al. Computational prediction of eukaryotic phosphorylation sites , 2011, Bioinform..

[5] Q Zou,et al. Novel representation of RNA secondary structure used to improve prediction algorithms. , 2011, Genetics and molecular research : GMR.

[6] K. Chou. Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[7] Wei Chen,et al. Prediction of phosphothreonine sites in human proteins by fusing different features , 2016, Scientific Reports.

[8] Jie Wang,et al. Discriminative pattern mining and its applications in bioinformatics , 2015, Briefings Bioinform..

[9] Jun Wu,et al. Mining Conditional Phosphorylation Motifs , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10] Dong Xu,et al. Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites* , 2010, Molecular & Cellular Proteomics.

[11] Xuan Liu,et al. Identification of DNA-Binding Proteins by Combining Auto-Cross Covariance Transformation and Ensemble Learning , 2016, IEEE Transactions on NanoBioscience.

[12] Subhadip Basu,et al. AMS 3.0: prediction of post-translational modifications , 2010, BMC Bioinformatics.

[13] Wei Chen,et al. Identification of apolipoprotein using feature selection technique , 2016, Scientific Reports.

[14] Ashis Kumer Biswas,et al. Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information , 2010, BMC Bioinformatics.

[15] Ian H. Witten,et al. The WEKA data mining software: an update , 2009, SKDD.

[16] Gianluca Valentino,et al. Machine learning techniques for protein function prediction , 2020, Proteins.

[17] Bo Yao,et al. PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine , 2014, Amino Acids.

[18] Mona Singh,et al. Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[19] Adam Godzik,et al. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[20] B. Liu,et al. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation , 2015, Scientific Reports.

[21] L. Iakoucheva,et al. The importance of intrinsic disorder for protein phosphorylation. , 2004, Nucleic acids research.

[22] Achuthsankar S. Nair,et al. Composition, Transition and Distribution (CTD) — A dynamic feature for predictions based on hierarchical structure of cellular sorting , 2011, 2011 Annual IEEE India Conference.

[23] G. Crooks,et al. WebLogo: a sequence logo generator. , 2004, Genome research.

[24] Jijun Tang,et al. Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[25] Junjie Chen,et al. A comprehensive review and comparison of different computational methods for protein remote homology detection , 2018, Briefings Bioinform..

[26] Qiwen Dong,et al. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. , 2016, IEEE transactions on nanobioscience.

[27] Xing Gao,et al. An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information , 2015, IEEE Transactions on NanoBioscience.

[28] Wei Chen,et al. iOri-Human: identify human origin of replication by incorporating dinucleotide physicochemical properties into pseudo nucleotide composition , 2016, Oncotarget.

[29] Wei Chen,et al. Predicting cancerlectins by the optimal g-gap dipeptides , 2015, Scientific Reports.

[30] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[31] Ren Long,et al. iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[32] Q. Zou,et al. Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition , 2016, International journal of molecular sciences.

[33] P Vallotton,et al. Detection of tubule boundaries based on circular shortest path and polar‐transformation of arbitrary shapes , 2016, Journal of microscopy.

[34] Hua Tang,et al. Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. , 2016, Molecular bioSystems.

[35] Jun Wu,et al. Data construction for phosphorylation site prediction , 2014, Briefings Bioinform..

[36] Xing Gao,et al. Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[37] N. Blom,et al. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. , 1999, Journal of molecular biology.

[38] Allegra Via,et al. Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[39] Liujuan Cao,et al. A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[40] Joachim Selbig,et al. PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor , 2007, Nucleic Acids Res..

[41] Junjie Chen,et al. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[42] Ying Ju,et al. Finding the Best Classification Threshold in Imbalanced Classification , 2016, Big Data Res..