OH-PRED: prediction of protein hydroxylation sites by incorporating adapted normal distribution bi-profile Bayes feature extraction and physicochemical properties of amino acids

Hydroxylation of proline or lysine residues in proteins is a common post-translational modification event, and such modifications are found in many physiological and pathological processes. Nonetheless, the exact molecular mechanism of hydroxylation remains under investigation. Because experimental identification of hydroxylation is time-consuming and expensive, bioinformatics tools with high accuracy represent desirable alternatives for large-scale rapid identification of protein hydroxylation sites. In view of this, we developed a supporter vector machine-based tool, OH-PRED, for the prediction of protein hydroxylation sites using the adapted normal distribution bi-profile Bayes feature extraction in combination with the physicochemical property indexes of the amino acids. In a jackknife cross validation, OH-PRED yields an accuracy of 91.88% and a Matthew’s correlation coefficient (MCC) of 0.838 for the prediction of hydroxyproline sites, and yields an accuracy of 97.42% and a MCC of 0.949 for the prediction of hydroxylysine sites. These results demonstrate that OH-PRED increased significantly the prediction accuracy of hydroxyproline and hydroxylysine sites by 7.37 and 14.09%, respectively, when compared with the latest predictor PredHydroxy. In independent tests, OH-PRED also outperforms previously published methods.

[1]  M. Yamauchi,et al.  Lysine post-translational modifications of collagen. , 2012, Essays in biochemistry.

[2]  Cangzhi Jia,et al.  Prediction of Protein S-Nitrosylation Sites Based on Adapted Normal Distribution Bi-Profile Bayes and Chou’s Pseudo Amino Acid Composition , 2014, International journal of molecular sciences.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Jorma Laurikkala,et al.  Improving Identification of Difficult Small Classes by Balancing Class Distribution , 2001, AIME.

[5]  Jianding Qiu,et al.  PredHydroxy: computational prediction of protein hydroxylation site locations based on the primary structure. , 2015, Molecular bioSystems.

[6]  Yu-Dong Cai,et al.  Prediction and Analysis of Protein Hydroxyproline and Hydroxylysine , 2010, PloS one.

[7]  K. Chou,et al.  Monte Carlo simulation studies on the prediction of protein folding types from amino acid composition. , 1992, Biophysical journal.

[8]  K. Chou,et al.  iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model , 2015, Journal of biomolecular structure & dynamics.

[9]  K. Chou,et al.  Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms , 2010 .

[10]  D. Powers Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation , 2008 .

[11]  J. Whitehead,et al.  Adiponectin multimerization is dependent on conserved lysines in the collagenous domain: evidence for regulation of multimerization by alterations in posttranslational modifications. , 2006, Molecular endocrinology.

[12]  D. Eyre,et al.  Collagen prolyl 3-hydroxylation: a major role for a minor post-translational modification? , 2013, Connective tissue research.

[13]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[14]  R. Nissi,et al.  Prolyl 4-Hydroxylase , 2003 .

[15]  P. Ratcliffe,et al.  Proteomics-based Identification of Novel Factor Inhibiting Hypoxia-inducible Factor (FIH) Substrates Indicates Widespread Asparaginyl Hydroxylation of Ankyrin Repeat Domain-containing Proteins*S⃞ , 2009, Molecular & Cellular Proteomics.

[16]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[17]  Christopher J. Oldfield,et al.  Functional anthology of intrinsic disorder. 3. Ligands, post-translational modifications, and diseases associated with intrinsically disordered proteins. , 2007, Journal of proteome research.

[18]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[19]  Dong Xu,et al.  Computational Identification of Protein Methylation Sites through Bi-Profile Bayes Feature Extraction , 2009, PloS one.

[20]  M. Mann,et al.  Jmjd6 Catalyses Lysyl-Hydroxylation of U2AF65, a Protein Associated with RNA Splicing , 2009, Science.

[21]  Zhi-ping Wang,et al.  O-GlcNAcPRED: a sensitive predictor to capture protein O-GlcNAcylation sites. , 2013, Molecular bioSystems.

[22]  K. Chou,et al.  iHyd-PseAAC: Predicting Hydroxyproline and Hydroxylysine in Proteins by Incorporating Dipeptide Position-Specific Propensity into Pseudo Amino Acid Composition , 2014, International journal of molecular sciences.

[23]  Ling-Yun Wu,et al.  iSuc-PseAAC: predicting lysine succinylation in proteins by incorporating peptide position-specific propensity , 2015, Scientific Reports.

[24]  Bela Stantic,et al.  EASE-MM: Sequence-Based Prediction of Mutation-Induced Stability Changes with Feature-Based Multiple Models. , 2016, Journal of molecular biology.

[25]  Kuo-Chen Chou,et al.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. , 2006, Journal of proteome research.

[26]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[27]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[28]  Zheng Rong Yang,et al.  Predict Collagen Hydroxyproline Sites Using Support Vector Machines , 2009, J. Comput. Biol..