iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier

Protein phosphorylation is a posttranslational modification (PTM or PTLM), where a phosphoryl group is added to the residue(s) of a protein molecule. The most commonly phosphorylated amino acids occur at serine (S), threonine (T), and tyrosine (Y). Protein phosphorylation plays a significant role in a wide range of cellular processes; meanwhile its dysregulation is also involved with many diseases. Therefore, from the angles of both basic research and drug development, we are facing a challenging problem: for an uncharacterized protein sequence containing many residues of S, T, or Y, which ones can be phosphorylated, and which ones cannot? To address this problem, we have developed a predictor called iPhos-PseEn by fusing four different pseudo component approaches (amino acids’ disorder scores, nearest neighbor scores, occurrence frequencies, and position weights) into an ensemble classifier via a voting system. Rigorous cross-validations indicated that the proposed predictor remarkably outperformed its existing counterparts. For the convenience of most experimental scientists, a user-friendly web-server for iPhos-PseEn has been established at http://www.jci-bioinfo.cn/iPhos-PseEn, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved.

[1]  Xiaolong Wang,et al.  iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach , 2016, Journal of biomolecular structure & dynamics.

[2]  K. Chou,et al.  Prediction of linear B-cell epitopes using amino acid pair antigenicity scale , 2007, Amino Acids.

[3]  K. Chou,et al.  iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in Targeting Ion Channels , 2014, BioMed research international.

[4]  K. Chou,et al.  EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. , 2007, Biochemical and biophysical research communications.

[5]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[6]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[7]  Dong Xu,et al.  Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites* , 2010, Molecular & Cellular Proteomics.

[8]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[9]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[10]  Guo-Ping Zhou The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein–protein interaction mechanism , 2011, Journal of Theoretical Biology.

[11]  L. Iakoucheva,et al.  Intrinsic Disorder and Protein Function , 2002 .

[12]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..

[13]  M. Fussenegger,et al.  Use of antibodies for detection of phosphorylated proteins separated by two‐dimensional gel electrophoresis , 2001, Proteomics.

[14]  K. Chou Prediction of signal peptides using scaled window , 2001, Peptides.

[15]  B. Liu,et al.  Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach , 2015, PloS one.

[16]  K. Chou,et al.  iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model , 2015, Journal of biomolecular structure & dynamics.

[17]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[18]  Jef Rozenski,et al.  The RNA modification database, RNAMDB: 2011 update , 2010, Nucleic Acids Res..

[19]  K. Chou,et al.  Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. , 2007, Journal of proteome research.

[20]  J. Stamler,et al.  Protein S-nitrosylation in health and disease: a current perspective. , 2009, Trends in molecular medicine.

[21]  Brett S Phinney,et al.  CaMKII Phosphorylation of Na(V)1.5: Novel in Vitro Sites Identified by Mass Spectrometry and Reduced S516 Phosphorylation in Human Heart Failure. , 2015, Journal of proteome research.

[22]  Kuo-Chen Chou,et al.  iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets , 2016, Molecules.

[23]  James G. Lyons,et al.  Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou׳s general PseAAC. , 2015, Journal of theoretical biology.

[24]  Zoran Obradovic,et al.  Length-dependent prediction of protein intrinsic disorder , 2006, BMC Bioinformatics.

[25]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[26]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[27]  K. Chou,et al.  Signal-3L: A 3-layer approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[28]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[29]  H.-B. Shen,et al.  Using ensemble classifier to identify membrane protein types , 2006, Amino Acids.

[30]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[31]  Guo-Ping Zhou,et al.  Subcellular location prediction of apoptosis proteins , 2002, Proteins.

[32]  Loris Nanni,et al.  Prediction of protein structure classes by incorporating different protein descriptors into general Chou's pseudo amino acid composition. , 2014, Journal of theoretical biology.

[33]  K. Chou,et al.  A sequence‐coupled vector‐projection model for predicting the specificity of GalNAc‐transferase , 1995, Protein science : a publication of the Protein Society.

[34]  Kuo-Chen Chou,et al.  An unprecedented revolution in medicinal science , 2015 .

[35]  K. Chou,et al.  pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties. , 2016, Analytical biochemistry.

[36]  K. Chou,et al.  A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. , 1993, The Journal of biological chemistry.

[37]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[38]  Gavin Sherlock,et al.  DNA methylation profiling reveals novel biomarkers and important roles for DNA methyltransferases in prostate cancer. , 2011, Genome research.

[39]  Kuo-Chen Chou,et al.  iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. , 2016, Analytical biochemistry.

[40]  Takashi Uehara,et al.  S-Nitrosylated protein-disulphide isomerase links protein misfolding to neurodegeneration , 2006, Nature.

[41]  K. Chou,et al.  iSS-PseDNC: Identifying Splicing Sites Using Pseudo Dinucleotide Composition , 2014, BioMed research international.

[42]  H. Dyson,et al.  Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. , 1999, Journal of molecular biology.

[43]  Dong Xu,et al.  Predicting and Analyzing Protein Phosphorylation Sites in Plants Using Musite , 2012, Front. Plant Sci..

[44]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[45]  K. Chou,et al.  iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. , 2013, Analytical biochemistry.

[46]  Manish Kumar,et al.  Prediction of β-lactamase and its class by Chou's pseudo-amino acid composition and support vector machine. , 2015, Journal of theoretical biology.

[47]  K. Chou,et al.  iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins , 2013, PeerJ.

[48]  Shao-Ping Shi,et al.  Using support vector machines to identify protein phosphorylation sites in viruses. , 2015, Journal of molecular graphics & modelling.

[49]  Zaheer Ullah Khan,et al.  Discrimination of acidic and alkaline enzyme using Chou's pseudo amino acid composition in conjunction with probabilistic neural network model. , 2015, Journal of theoretical biology.

[50]  K. Chou,et al.  2D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids. , 2010, Journal of theoretical biology.

[51]  S. Forsén,et al.  Graphical rules for enzyme-catalysed rate laws. , 1980, The Biochemical journal.

[52]  G. Zhou,et al.  An extension of Chou's graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways. , 1984, The Biochemical journal.

[53]  Dong Xu,et al.  iPhos‐PseEvo: Identifying Human Phosphorylated Proteins by Incorporating Evolutionary Information into General PseAAC via Grey System Theory , 2017, Molecular informatics.

[54]  K. Chou,et al.  iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. , 2013, Molecular bioSystems.

[55]  Kuo-Chen Chou,et al.  RSARF: prediction of residue solvent accessibility from protein sequence using random forest method. , 2012, Protein and peptide letters.

[56]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[57]  Kuo-Chen Chou,et al.  Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[58]  Sukanta Mondal,et al.  Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. , 2014, Journal of theoretical biology.

[59]  K. Chou,et al.  Wenxiang: a web-server for drawing wenxiang diagrams , 2011 .

[60]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[61]  Guo-Ping Zhou,et al.  An Intriguing Controversy over Protein Structural Class Prediction , 1998, Journal of protein chemistry.

[62]  K. Chou,et al.  iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC , 2016, Oncotarget.

[63]  Kuo-Chen Chou,et al.  Sequence analysis iEnhancer-2 L : a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016 .

[64]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[65]  K. Chou,et al.  iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. , 2011, Journal of theoretical biology.

[66]  K. Chou,et al.  Graphic rules in steady and non-steady state enzyme kinetics. , 1989, The Journal of biological chemistry.

[67]  K. Chou,et al.  iHyd-PseAAC: Predicting Hydroxyproline and Hydroxylysine in Proteins by Incorporating Dipeptide Position-Specific Propensity into Pseudo Amino Acid Composition , 2014, International journal of molecular sciences.

[68]  Xiang Cheng,et al.  iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach , 2015, Journal of biomolecular structure & dynamics.

[69]  Kuo-Chen Chou,et al.  pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. , 2016, Journal of theoretical biology.

[70]  B. Liu,et al.  PseDNA‐Pro: DNA‐Binding Protein Identification by Combining Chou’s PseAAC and Physicochemical Distance Transformation , 2015, Molecular informatics.

[71]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[72]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[73]  Kuo-Chen Chou,et al.  iROS-gPseKNC: Predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition , 2016, Oncotarget.

[74]  K. Chou,et al.  Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. , 2007, Protein engineering, design & selection : PEDS.

[75]  K. Chou,et al.  Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. , 2007, Biopolymers.

[76]  K. Chou,et al.  iSNO-PseAAC: Predict Cysteine S-Nitrosylation Sites in Proteins by Incorporating Position Specific Amino Acid Propensity into Pseudo Amino Acid Composition , 2013, PloS one.

[77]  L. Iakoucheva,et al.  The importance of intrinsic disorder for protein phosphorylation. , 2004, Nucleic acids research.

[78]  Jacques Lapointe,et al.  Theoretical and experimental biology in one—A symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers , 2013 .

[79]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[80]  K. Chou Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[81]  K. Chou,et al.  iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition , 2014, PloS one.

[82]  Qiuwen Zhang,et al.  MultiP-SChlo: Multi-label protein subchloroplast localization prediction , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[83]  L. Resnick,et al.  The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase. , 1993, The Journal of biological chemistry.

[84]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[85]  Kuo-Chen Chou,et al.  A sequence-based approach for predicting protein disordered regions. , 2013, Protein and peptide letters.

[86]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[87]  K. Chou,et al.  Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. , 2006, Biochemical and biophysical research communications.

[88]  Maqsood Hayat,et al.  Prediction of Protein Submitochondrial Locations by Incorporating Dipeptide Composition into Chou’s General Pseudo Amino Acid Composition , 2016, The Journal of Membrane Biology.

[89]  Maqsood Hayat,et al.  iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples , 2015, Molecular Genetics and Genomics.

[90]  Byong-Seok Choi,et al.  Residual structure within the disordered C‐terminal segment of p21Waf1/Cip1/Sdi1 and its implications for molecular recognition , 2009, Protein science : a publication of the Protein Society.

[91]  Kuo-Chen Chou,et al.  iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. , 2015, Journal of theoretical biology.

[92]  K. Chou,et al.  iMethyl-PseAAC: Identification of Protein Methylation Sites via a Pseudo Amino Acid Composition Approach , 2014, BioMed research international.

[93]  P. Suganthan,et al.  AFP-Pred: A random forest approach for predicting antifreeze proteins from sequence-derived properties. , 2011, Journal of theoretical biology.

[94]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[95]  Wei Chen,et al.  iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. , 2014, Analytical biochemistry.

[96]  W. Zhong,et al.  Molecular Science for Drug Development and Biomedicine , 2014, International journal of molecular sciences.

[97]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[98]  Shigeo Takenaka,et al.  Identification of phosphorylation sites in Hansenula polymorpha Pex14p by mass spectrometry , 2012, FEBS open bio.

[99]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[100]  Kuo-Chen Chou,et al.  Prediction of Membrane Protein Types by Incorporating Amphipathic Effects , 2005, J. Chem. Inf. Model..

[101]  Shu-Yun Huang,et al.  PMeS: Prediction of Methylation Sites Based on Enhanced Feature Encoding Scheme , 2012, PloS one.

[102]  K. Chou,et al.  Prediction of protein signal sequences and their cleavage sites , 2001, Proteins.

[103]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[104]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[105]  Hui Ding,et al.  Using deformation energy to analyze nucleosome positioning in genomes. , 2016, Genomics.

[106]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[107]  J. Chou,et al.  Kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-88204E. , 1993, Biochemistry.

[108]  K. Chou,et al.  Recent Progress in Predicting Posttranslational Modification Sites in Proteins. , 2015, Current topics in medicinal chemistry.

[109]  K. Chou,et al.  iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. , 2015, Analytical biochemistry.

[110]  Saeed Ahmad,et al.  Identification of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC , 2015, Comput. Methods Programs Biomed..

[111]  Kuo-Chen Chou,et al.  Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition , 2016, Journal of biomolecular structure & dynamics.

[112]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[113]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[114]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..