Prediction of the Bonding State of Cysteine Residues in Proteins with Machine-Learning Methods

In this paper we evaluate the performance of machine learning methods in the task of predicting the bonding state of cysteines starting from protein sequences. This task is the first step for the identification of disulfide bonds in proteins. We score the performance of three different approaches: 1) Hidden Support Vector Machines (HSVMs) which integrate the SVM predictions with a Hidden Markov Model; 2) SVM-HMMs which discriminatively train models that are isomorphic to a kth-order hidden Markov model; 3) Grammatical-Restrained Hidden Conditional Random Fields (GRHCRFs) that we recently introduced. We evaluate two different encoding schemes based on sequence profile and position specific scoring matrix (PSSM) as computed with the PSIBLAST program and we show that when the evolutionary information is encoded with PSSM all the methods perform better than with sequence profile. Among the different methods it appears that GRHCRFs perform slightly better than the others achieving a per protein accuracy of 87% with a Matthews correlation coefficient (C) of 0.73. Finally, we investigate the difference between disulfide bonding state predictions in Eukaryotes and Prokaryotes. Our analysis shows that the per-protein accuracy in Prokaryotic proteins is higher than that in Eukaryotes (0.88 vs 0.83). However, given the paucity of bonded cysteines in Prokaryotes as compared to Eukaryotes the Matthews correlation coefficient is drastically reduced (0.48 vs 0.80).

[1]  Paolo Frasconi,et al.  A simplified approach to disulfide connectivity prediction from protein sequences , 2008, BMC Bioinformatics.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Piero Fariselli,et al.  Prediction of the disulfide bonding state of cysteines in proteins with hidden neural networks. , 2002, Protein engineering.

[4]  Jon Beckwith,et al.  Protein disulfide bond formation in prokaryotes. , 2003, Annual review of biochemistry.

[5]  Piero Fariselli,et al.  Grammatical-Restrained Hidden Conditional Random Fields for Bioinformatics applications , 2009, Algorithms for Molecular Biology.

[6]  Hsuan-Liang Liu,et al.  Recent Advances in Disulfide Connectivity Predictions , 2007 .

[7]  J. Beckwith,et al.  Escherichia coli alkaline phosphatase fails to acquire disulfide bonds when retained in the cytoplasm , 1991, Journal of bacteriology.

[8]  P Fariselli,et al.  Role of evolutionary information in predicting the disulfide‐bonding state of cysteine in proteins , 1999, Proteins.

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[11]  P Tufféry,et al.  Predicting the disulfide bonding state of cysteines using protein descriptors , 2002, Proteins.

[12]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[13]  T. Creighton Proteins: Structures and Molecular Properties , 1986 .

[14]  Pierre Baldi,et al.  Large-Scale Prediction of Disulphide Bond Connectivity , 2004, NIPS.

[15]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[16]  Hitoshi Nakamoto,et al.  Catalysis of disulfide bond formation and isomerization in the Escherichia coli periplasm. , 2004, Biochimica et biophysica acta.

[17]  Wen-Bo Xu,et al.  Prediction of the disulfide-bonding state of cysteines in proteins based on dipeptide composition. , 2004, Biochemical and biophysical research communications.

[18]  Deborah Fass,et al.  Modulation of Cellular Disulfide-Bond Formation and the ER Redox Environment by Feedback Regulation of Ero1 , 2007, Cell.

[19]  András Fiser,et al.  Predicting the oxidation state of cysteines by multiple sequence alignment , 2000, Bioinform..

[20]  Jenn-Kang Hwang,et al.  Prediction of the bonding states of cysteines Using the support vector machines based on multiple feature vectors and cysteine state sequences , 2004, Proteins.

[21]  Alessio Ceroni,et al.  DISULFIND: a disulfide bonding state and cysteine connectivity prediction server , 2006, Nucleic Acids Res..