Accurate prediction for atomic‐level protein design and its application in diversifying the near‐optimal sequence space

The task of engineering a protein to assume a target three‐dimensional structure is known as protein design. Computational search algorithms are devised to predict a minimal energy amino acid sequence for a particular structure. In practice, however, an ensemble of low‐energy sequences is often sought. Primarily, this is performed because an individual predicted low‐energy sequence may not necessarily fold to the target structure because of both inaccuracies in modeling protein energetics and the nonoptimal nature of search algorithms employed. Additionally, some low‐energy sequences may be overly stable and thus lack the dynamic flexibility required for biological functionality. Furthermore, the investigation of low‐energy sequence ensembles will provide crucial insights into the pseudo‐physical energy force fields that have been derived to describe structural energetics for protein design. Significantly, numerous studies have predicted low‐energy sequences, which were subsequently synthesized and demonstrated to fold to desired structures. However, the characterization of the sequence space defined by such energy functions as compatible with a target structure has not been performed in full detail. This issue is critical for protein design scientists to successfully continue using these force fields at an ever‐increasing pace and scale. In this paper, we present a conceptually novel algorithm that rapidly predicts the set of lowest energy sequences for a given structure. Based on the theory of probabilistic graphical models, it performs efficient inspection and partitioning of the near‐optimal sequence space, without making any assumptions of positional independence. We benchmark its performance on a diverse set of relevant protein design examples and show that it consistently yields sequences of lower energy than those derived from state‐of‐the‐art techniques. Thus, we find that previously presented search techniques do not fully depict the low‐energy space as precisely. Examination of the predicted ensembles indicates that, for each structure, the amino acid identity at a majority of positions must be chosen extremely selectively so as to not incur significant energetic penalties. We investigate this high degree of similarity and demonstrate how more diverse near‐optimal sequences can be predicted in order to systematically overcome this bottleneck for computational design. Finally, we exploit this in‐depth analysis of a collection of the lowest energy sequences to suggest an explanation for previously observed experimental design results. The novel methodologies introduced here accurately portray the sequence space compatible with a protein structure and further supply a scheme to yield heterogeneous low‐energy sequences, thus providing a powerful instrument for future work on protein design. Proteins 2009. © 2008 Wiley‐Liss, Inc.

[1]  Colin A. Smith,et al.  Backrub-like backbone simulation recapitulates natural protein conformational variability and improves mutant side-chain prediction. , 2008, Journal of molecular biology.

[2]  Igor N. Berezovsky,et al.  Positive and Negative Design in Stability and Thermal Adaptation of Natural Proteins , 2006, PLoS Comput. Biol..

[3]  Roland L. Dunbrack,et al.  Backbone-dependent rotamer library for proteins. Application to side-chain prediction. , 1993, Journal of molecular biology.

[4]  Eric P. Xing,et al.  Free Energy Estimates of All-Atom Protein Structures Using Generalized Belief Propagation , 2007, RECOMB.

[5]  A R Leach,et al.  Exploring the conformational space of protein side chains using dead‐end elimination and the A* algorithm , 1998, Proteins.

[6]  Julia M. Shifman,et al.  Modulating calmodulin binding specificity through computational protein design. , 2002, Journal of molecular biology.

[7]  Conrad C. Huang,et al.  UCSF Chimera—A visualization system for exploratory research and analysis , 2004, J. Comput. Chem..

[8]  Bruce Randall Donald,et al.  Algorithm for backrub motions in protein design , 2008, ISMB.

[9]  Dmitry M. Malioutov,et al.  Walk-Sums and Belief Propagation in Gaussian Graphical Models , 2006, J. Mach. Learn. Res..

[10]  F. Arnold Combinatorial and computational challenges for biocatalyst design , 2001, Nature.

[11]  D. Benjamin Gordon,et al.  Exact rotamer optimization for protein design , 2003, J. Comput. Chem..

[12]  Mona Singh,et al.  Solving and analyzing side-chain positioning problems using linear and integer programming , 2005, Bioinform..

[13]  B. Dahiyat,et al.  Combining computational and experimental screening for rapid optimization of protein properties , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  L L Looger,et al.  Generalized dead-end elimination algorithms make large-scale protein side-chain structure prediction tractable: implications for protein design and structural genomics. , 2001, Journal of molecular biology.

[15]  M Rosenberg,et al.  Computational protein design: a novel path to future protein drugs. , 2006, Current pharmaceutical design.

[16]  W. Lim,et al.  Alternative packing arrangements in the hydrophobic core of λrepresser , 1989, Nature.

[17]  P. Harbury,et al.  Automated design of specificity in molecular recognition , 2003, Nature Structural Biology.

[18]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[19]  Johan Desmet,et al.  The dead-end elimination theorem and its use in protein side-chain positioning , 1992, Nature.

[20]  Bruce Randall Donald,et al.  Dead-End Elimination with Backbone Flexibility , 2007, ISMB/ECCB.

[21]  Christopher T. Saunders,et al.  Recapitulation of protein family divergence using flexible backbone protein design. , 2005, Journal of molecular biology.

[22]  Wei Wang,et al.  Progress in the development and application of computational methods for probabilistic protein design , 2005, Comput. Chem. Eng..

[23]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[24]  Colin A. Smith,et al.  A simple model of backbone flexibility improves modeling of side-chain conformational variability. , 2008, Journal of molecular biology.

[25]  Eric P. Xing,et al.  Free Energy Estimates of All-Atom Protein Structures Using Generalized Belief Propagation , 2007, RECOMB.

[26]  I. Lasters,et al.  Fast and accurate side‐chain topology and energy refinement (FASTER) as a new method for protein structure optimization , 2002, Proteins.

[27]  J G Saven,et al.  Statistical theory for protein combinatorial libraries. Packing interactions, backbone flexibility, and the sequence variability of a main-chain structure. , 2001, Journal of molecular biology.

[28]  Chris Sander,et al.  The HSSP database of protein structure-sequence alignments and family profiles , 1998, Nucleic Acids Res..

[29]  Stephen L. Mayo,et al.  Conformational splitting: A more powerful criterion for dead-end elimination , 2000, J. Comput. Chem..

[30]  William T. Freeman,et al.  Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology , 1999, Neural Computation.

[31]  Niles A Pierce,et al.  Protein design is NP-hard. , 2002, Protein engineering.

[32]  Tomás Lozano-Pérez,et al.  Protein Side-Chain Placement Through MAP Estimation and Problem-Size Reduction , 2006, WABI.

[33]  M. Gruebele Protein folding: the free energy surface. , 2002, Current opinion in structural biology.

[34]  M Delarue,et al.  The inverse protein folding problem: self consistent mean field optimisation of a structure specific mutation matrix. , 1997, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[35]  Yair Weiss,et al.  MAP Estimation, Linear Programming and Belief Propagation with Convex Free Energies , 2007, UAI.

[36]  David C. Richardson,et al.  MOLPROBITY: structure validation and all-atom contact analysis for nucleic acids and their complexes , 2004, Nucleic Acids Res..

[37]  A. Gronenborn,et al.  Solution structure of a calmodulin-target peptide complex by multidimensional NMR. , 1994, Science.

[38]  J. Mendes,et al.  Improved modeling of side‐chains in proteins with rotamer‐based methods: A flexible rotamer model , 1999, Proteins.

[39]  Vijay S Pande,et al.  Increased detection of structural templates using alignments of designed sequences , 2003, Proteins.

[40]  Torsten Becker,et al.  An extended dead‐end elimination algorithm to determine gap‐free lists of low energy states , 2007, Journal of Computational Chemistry.

[41]  T. Speed,et al.  Gaussian Markov Distributions over Finite Graphs , 1986 .

[42]  Sachdev S Sidhu,et al.  Comprehensive and Quantitative Mapping of Energy Landscapes for Protein-Protein Interactions by Rapid Combinatorial Scanning*♦ , 2006, Journal of Biological Chemistry.

[43]  Michael I. Jordan Graphical Models , 2003 .

[44]  Mona Singh,et al.  A Semidefinite Programming Approach to Side Chain Positioning with New Rounding Strategies , 2004, INFORMS J. Comput..

[45]  Tanja Kortemme,et al.  Design of Multi-Specificity in Protein Interfaces , 2007, PLoS Comput. Biol..

[46]  Witold K. Surewicz,et al.  Crystal structure of the human prion protein reveals a mechanism for oligomerization , 2002, Nature Structural Biology.

[47]  Hao Fan,et al.  Refinement of homology‐based protein structures by molecular dynamics simulation techniques , 2004, Protein science : a publication of the Protein Society.

[48]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[49]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[50]  Geoffrey K. Hom,et al.  A search algorithm for fixed‐composition protein design , 2006, J. Comput. Chem..

[51]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[52]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[53]  R. Goldstein Efficient rotamer elimination applied to protein side-chains and related spin glasses. , 1994, Biophysical journal.

[54]  Bruce Randall Donald,et al.  A Novel Minimized Dead-End Elimination Criterion and Its Application to Protein Redesign in a Hybrid Scoring and Search Algorithm for Computing Partition Functions over Molecular Ensembles , 2006, RECOMB.

[55]  Christopher M. Summa,et al.  An atomic environment potential for use in protein structure prediction. , 2005, Journal of molecular biology.

[56]  F A Quiocho,et al.  Target enzyme recognition by calmodulin: 2.4 A structure of a calmodulin-peptide complex. , 1992, Science.

[57]  O. Schueler‐Furman,et al.  Progress in Modeling of Protein Structures and Interactions , 2005, Science.

[58]  Yair Weiss,et al.  Minimizing and Learning Energy Functions for Side-Chain Prediction , 2007, RECOMB.

[59]  Y. Weiss,et al.  Finding the M Most Probable Configurations using Loopy Belief Propagation , 2003, NIPS 2003.

[60]  Lorenz Wernisch,et al.  Folding free energy function selects native-like protein sequences in the core but not on the surface , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[61]  Feng Ding,et al.  Modeling backbone flexibility improves protein stability estimation. , 2007, Structure.

[62]  M. Nadeau,et al.  Proteins : Structure , Function , and Bioinformatics , 2022 .

[63]  Christopher A. Voigt,et al.  Trading accuracy for speed: A quantitative comparison of search algorithms in protein sequence design. , 2000, Journal of molecular biology.

[64]  W. Jin,et al.  De novo design of foldable proteins with smooth folding funnel: automated negative design and experimental verification. , 2003, Structure.

[65]  Yair Weiss,et al.  Approximate Inference and Protein-Folding , 2002, NIPS.

[66]  Abdesselam Bouzerdoum,et al.  Skin segmentation using color pixel classification: analysis and comparison , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Vijay S Pande,et al.  Thoroughly sampling sequence space: Large‐scale protein design of structural ensembles , 2002, Protein science : a publication of the Protein Society.

[68]  D. Baker,et al.  Design of a Novel Globular Protein Fold with Atomic-Level Accuracy , 2003, Science.

[69]  Chen Yanover,et al.  Approximate Inference and Side-chain Prediction , 2007 .

[70]  Menachem Fromer,et al.  Dead‐end elimination for multistate protein design , 2007, J. Comput. Chem..

[71]  Feng Ding,et al.  Emergence of Protein Fold Families through Rational Design , 2006, PLoS Comput. Biol..

[72]  Julia M. Shifman,et al.  Exploring the origins of binding specificity through the computational redesign of calmodulin , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[73]  Menachem Fromer,et al.  A computational framework to empower probabilistic protein design , 2008, ISMB.

[74]  G. Marius Clore,et al.  Design of a Novel Peptide Inhibitor of HIV Fusion That Disrupts the Internal Trimeric Coiled-coil of gp41* , 2002, The Journal of Biological Chemistry.

[75]  T. Baker,et al.  Specificity versus stability in computational protein design. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[76]  Tommi S. Jaakkola,et al.  Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations , 2007, NIPS.

[77]  Jon M. Kleinberg,et al.  The evolutionary capacity of protein structures , 2004, RECOMB '04.

[78]  S. A. Marshall,et al.  Energy functions for protein design. , 1999, Current opinion in structural biology.

[79]  Bruce Randall Donald,et al.  A Novel Minimized Dead-End Elimination Criterion and Its Application to Protein Redesign in a Hybrid Scoring and Search Algorithm for Computing Partition Functions over Molecular Ensembles , 2006, RECOMB.

[80]  D. Baker,et al.  Native protein sequences are close to optimal for their structures. , 2000, Proceedings of the National Academy of Sciences of the United States of America.