Minimizing and Learning Energy Functions for Side-Chain Prediction

Side-chain prediction is an important subproblem of the general protein folding problem. Despite much progress in side-chain prediction, performance is far from satisfactory. As an example, the ROSETTA program that uses simulated annealing to select the minimum energy conformations, correctly predicts the first two side-chain angles for approximately 72% of the buried residues in a standard data set. Is further improvement more likely to come from better search methods, or from better energy functions? Given that exact minimization of the energy is NP hard, it is difficult to get a systematic answer to this question. In this paper, we present a novel search method and a novel method for learning energy functions from training data that are both based on Tree Reweighted Belief Propagation (TRBP). We find that TRBP can find the global optimum of the ROSETTA energy function in a few minutes of computation for approximately 85% of the proteins in a standard benchmark set. TRBP can also effectively bound the partition function which enables using the Conditional Random Fields (CRF) framework for learning. Interestingly, finding the global minimum does not significantly improve sidechain prediction for an energy function based on ROSETTA's default energy terms (less than 0.1%), while learning new weights gives a significant boost from 72% to 78%. Using a recently modified ROSETTA energy function with a softer Lennard-Jones repulsive term, the global optimum does improve prediction accuracy from 77% to 78%. Here again, learning new weights improves side-chain modeling even further to 80%. Finally, the highest accuracy (82.6%) is obtained using an extended rotamer library and CRF learned weights. Our results suggest that combining machine learning with approximate inference can improve the state-of-the-art in side-chain prediction.

[1]  R. Goldstein Efficient rotamer elimination applied to protein side-chains and related spin glasses. , 1994, Biophysical journal.

[2]  Aviezri S. Fraenkel Protein folding, spin glass and computational complexity , 1997, DNA Based Computers.

[3]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[4]  Yair Weiss,et al.  Approximate Inference and Protein-Folding , 2002, NIPS.

[5]  Jack Snoeyink,et al.  An Adaptive Dynamic Programming Algorithm for the Side Chain Placement Problem , 2004, Pacific Symposium on Biocomputing.

[6]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[7]  Yann LeCun,et al.  Loss Functions for Discriminative Training of Energy-Based Models , 2005, AISTATS.

[8]  M. Karplus,et al.  Effective energy function for proteins in solution , 1999, Proteins.

[9]  D. Baker,et al.  Native protein sequences are close to optimal for their structures. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Johan Desmet,et al.  The dead-end elimination theorem and its use in protein side-chain positioning , 1992, Nature.

[11]  Trevor Darrell,et al.  Conditional Random Fields for Object Recognition , 2004, NIPS.

[12]  D. Baker,et al.  An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein-protein complexes. , 2003, Journal of molecular biology.

[13]  Martin J. Wainwright,et al.  On the Optimality of Tree-reweighted Max-product Message-passing , 2005, UAI.

[14]  Barry Honig,et al.  Extending the accuracy limits of prediction for side-chain conformations. , 2001 .

[15]  Z. Xiang,et al.  Extending the accuracy limits of prediction for side-chain conformations. , 2001, Journal of molecular biology.

[16]  Yair Weiss,et al.  Globally optimal solutions for energy minimization in stereo vision using reweighted belief propagation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[17]  D. Baker,et al.  Modeling structurally variable regions in homologous proteins with rosetta , 2004, Proteins.

[18]  Vladimir Kolmogorov,et al.  Convergent Tree-Reweighted Message Passing for Energy Minimization , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  A R Leach,et al.  Exploring the conformational space of protein side chains using dead‐end elimination and the A* algorithm , 1998, Proteins.

[20]  Adrian A Canutescu,et al.  A graph‐theory algorithm for rapid protein side‐chain prediction , 2003, Protein science : a publication of the Protein Society.

[21]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[22]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[23]  S. L. Mayo,et al.  Conformational splitting: A more powerful criterion for dead‐end elimination , 2000 .

[24]  A Joshua Wand,et al.  Improved side‐chain prediction accuracy using an ab initio potential energy function and a very large rotamer library , 2004, Protein science : a publication of the Protein Society.

[25]  S. Subbiah,et al.  Prediction of protein side-chain conformation by packing optimization. , 1991, Journal of molecular biology.

[26]  Mona Singh,et al.  Solving and analyzing side-chain positioning problems using linear and integer programming , 2005, Bioinform..

[27]  Yi Liu,et al.  RosettaDesign server for protein design , 2006, Nucleic Acids Res..

[28]  Stephen L. Mayo,et al.  Conformational splitting: A more powerful criterion for dead-end elimination , 2000, J. Comput. Chem..

[29]  William T. Freeman,et al.  Understanding belief propagation and its generalizations , 2003 .

[30]  Martin J. Wainwright,et al.  MAP estimation via agreement on (hyper)trees: Message-passing and linear programming , 2005, ArXiv.

[31]  Yair Weiss,et al.  Linear Programming Relaxations and Belief Propagation - An Empirical Study , 2006, J. Mach. Learn. Res..

[32]  Xiaojin Zhu,et al.  Kernel conditional random fields: representation and clique selection , 2004, ICML.

[33]  D. Baker,et al.  High-resolution Structural and Thermodynamic Analysis of Extreme Stabilization of Human Procarboxypeptidase by Computational Protein Design , 2007, Journal of molecular biology.

[34]  Hao Chen,et al.  Beyond the rotamer library: Genetic algorithm combined with the disturbing mutation process for upbuilding protein side‐chains , 2003, Proteins.

[35]  Jeffrey J. Gray,et al.  Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. , 2003, Journal of molecular biology.

[36]  Roland L. Dunbrack,et al.  Backbone-dependent rotamer library for proteins. Application to side-chain prediction. , 1993, Journal of molecular biology.

[37]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[38]  P. Koehl,et al.  A self consistent mean field approach to simultaneous gap closure and side-chain positioning in homology modelling , 1995, Nature Structural Biology.

[39]  Martin J. Wainwright,et al.  MAP estimation via agreement on trees: message-passing and linear programming , 2005, IEEE Transactions on Information Theory.

[40]  A. Hasman,et al.  Probabilistic reasoning in intelligent systems: Networks of plausible inference , 1991 .

[41]  O. Schueler‐Furman,et al.  Improved side‐chain modeling for protein–protein docking , 2005, Protein science : a publication of the Protein Society.