A computational framework to empower probabilistic protein design

Motivation: The task of engineering a protein to perform a target biological function is known as protein design. A commonly used paradigm casts this functional design problem as a structural one, assuming a fixed backbone. In probabilistic protein design, positional amino acid probabilities are used to create a random library of sequences to be simultaneously screened for biological activity. Clearly, certain choices of probability distributions will be more successful in yielding functional sequences. However, since the number of sequences is exponential in protein length, computational optimization of the distribution is difficult. Results: In this paper, we develop a computational framework for probabilistic protein design following the structural paradigm. We formulate the distribution of sequences for a structure using the Boltzmann distribution over their free energies. The corresponding probabilistic graphical model is constructed, and we apply belief propagation (BP) to calculate marginal amino acid probabilities. We test this method on a large structural dataset and demonstrate the superiority of BP over previous methods. Nevertheless, since the results obtained by BP are far from optimal, we thoroughly assess the paradigm using high-quality experimental data. We demonstrate that, for small scale sub-problems, BP attains identical results to those produced by exact inference on the paradigmatic model. However, quantitative analysis shows that the distributions predicted significantly differ from the experimental data. These findings, along with the excellent performance we observed using BP on the smaller problems, suggest potential shortcomings of the paradigm. We conclude with a discussion of how it may be improved in the future. Contact: fromer@cs.huji.ac.il

[1]  D. Baker,et al.  Design of a Novel Globular Protein Fold with Atomic-Level Accuracy , 2003, Science.

[2]  Jon M. Kleinberg,et al.  The evolutionary capacity of protein structures , 2004, RECOMB '04.

[3]  S. A. Marshall,et al.  Energy functions for protein design. , 1999, Current opinion in structural biology.

[4]  Yair Weiss,et al.  Linear Programming Relaxations and Belief Propagation - An Empirical Study , 2006, J. Mach. Learn. Res..

[5]  Chris Sander,et al.  The HSSP database of protein structure-sequence alignments and family profiles , 1998, Nucleic Acids Res..

[6]  H Kono,et al.  Statistical Theory for Protein Combinatorial Libraries , 2001 .

[7]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[8]  Frances H. Arnold,et al.  Computational method to reduce the search space for directed protein evolution , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  M Rosenberg,et al.  Computational protein design: a novel path to future protein drugs. , 2006, Current pharmaceutical design.

[10]  Richard A Friesner,et al.  Computational prediction of native protein ligand-binding and enzyme active site sequences. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[11]  D. Baker,et al.  Native protein sequences are close to optimal for their structures. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Julia M. Shifman,et al.  Modulating calmodulin binding specificity through computational protein design. , 2002, Journal of molecular biology.

[13]  Conrad C. Huang,et al.  UCSF Chimera—A visualization system for exploratory research and analysis , 2004, J. Comput. Chem..

[14]  L. H. Bradley,et al.  De novo proteins from designed combinatorial libraries , 2004, Protein science : a publication of the Protein Society.

[15]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[16]  Sachdev S Sidhu,et al.  Comprehensive and Quantitative Mapping of Energy Landscapes for Protein-Protein Interactions by Rapid Combinatorial Scanning*♦ , 2006, Journal of Biological Chemistry.

[17]  Michael I. Jordan Graphical Models , 1998 .

[18]  G. Marius Clore,et al.  Design of a Novel Peptide Inhibitor of HIV Fusion That Disrupts the Internal Trimeric Coiled-coil of gp41* , 2002, The Journal of Biological Chemistry.

[19]  Robert Cowell,et al.  Advanced Inference in Bayesian Networks , 1999, Learning in Graphical Models.

[20]  O. Schueler‐Furman,et al.  Progress in Modeling of Protein Structures and Interactions , 2005, Science.

[21]  Yair Weiss,et al.  Minimizing and Learning Energy Functions for Side-Chain Prediction , 2007, RECOMB.

[22]  Eric P. Xing,et al.  Free Energy Estimates of All-Atom Protein Structures Using Generalized Belief Propagation , 2007, RECOMB.

[23]  Sheldon Park,et al.  Advances in computational protein design. , 2004, Current opinion in structural biology.

[24]  Yair Weiss,et al.  Approximate Inference and Protein-Folding , 2002, NIPS.

[25]  S. A. Marshall,et al.  Designing proteins for therapeutic applications. , 2003, Current opinion in structural biology.

[26]  Christopher T. Saunders,et al.  Recapitulation of protein family divergence using flexible backbone protein design. , 2005, Journal of molecular biology.

[27]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[28]  Costas D Maranas,et al.  Identifying residue–residue clashes in protein hybrids by using a second-order mean-field approach , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  M Delarue,et al.  The inverse protein folding problem: self consistent mean field optimisation of a structure specific mutation matrix. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[30]  E. M.,et al.  Statistical Mechanics , 2021, Manual for Theoretical Chemistry.

[31]  Wei Wang,et al.  Progress in the development and application of computational methods for probabilistic protein design , 2005, Comput. Chem. Eng..

[32]  Tanja Kortemme,et al.  Design of Multi-Specificity in Protein Interfaces , 2007, PLoS Comput. Biol..

[33]  Hidetoshi Kono,et al.  Computational design and characterization of a monomeric helical dinuclear metalloprotein. , 2003, Journal of molecular biology.

[34]  J G Saven,et al.  Statistical theory for protein combinatorial libraries. Packing interactions, backbone flexibility, and the sequence variability of a main-chain structure. , 2001, Journal of molecular biology.

[35]  Jeffery G. Saven,et al.  Computational methods for protein design and protein sequence variability: Biased Monte Carlo and replica exchange , 2005 .

[36]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[37]  Xiaoran Fu Stowell,et al.  Limitations of yeast surface display in engineering proteins of high thermostability. , 2006, Protein engineering, design & selection : PEDS.

[38]  Lorenz Wernisch,et al.  Folding free energy function selects native-like protein sequences in the core but not on the surface , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Stephen L Mayo,et al.  Exhaustive mutagenesis of six secondary active-site residues in Escherichia coli chorismate mutase shows the importance of hydrophobic side chains and a helix N-capping position for stability and catalysis. , 2007, Biochemistry.

[40]  F. Arnold Combinatorial and computational challenges for biocatalyst design , 2001, Nature.

[41]  Bruce Randall Donald,et al.  A novel ensemble-based scoring and search algorithm for protein redesign, and its application to modify the substrate specificity of the gramicidin synthetase A phenylalanine adenylation enzyme , 2004, RECOMB.

[42]  Roland L. Dunbrack,et al.  Backbone-dependent rotamer library for proteins. Application to side-chain prediction. , 1993, Journal of molecular biology.

[43]  Jinming Zou,et al.  Statistical theory for protein ensembles with designed energy landscapes. , 2005, The Journal of chemical physics.