Protein sequence design by explicit energy landscape optimization

The protein design problem is to identify an amino acid sequence which folds to a desired structure. Given Anfinsen’s thermodynamic hypothesis of folding, this can be recast as finding an amino acid sequence for which the lowest energy conformation is that structure. As this calculation involves not only all possible amino acid sequences but also all possible structures, most current approaches focus instead on the more tractable problem of finding the lowest energy amino acid sequence for the desired structure, often checking by protein structure prediction in a second step that the desired structure is indeed the lowest energy conformation for the designed sequence, and discarding the in many cases large fraction of designed sequences for which this is not the case. Here we show that by backpropagating gradients through the trRosetta structure prediction network from the desired structure to the input amino acid sequence, we can directly optimize over all possible amino acid sequences and all possible structures, and in one calculation explicitly design amino acid sequences predicted to fold into the desired structure and not any other. We find that trRosetta calculations, which consider the full conformational landscape, can be more effective than Rosetta single point energy estimations in predicting folding and stability of de novo designed proteins. We compare sequence design by landscape optimization to the standard fixed backbone sequence design methodology in Rosetta, and show that the results of the former, but not the latter, are sensitive to the presence of competing low-lying states. We show further that more funneled energy landscapes can be designed by combining the strengths of the two approaches: the low resolution trRosetta model serves to disfavor alternative states, and the high resolution Rosetta model, to create a deep energy minimum at the design target structure. Significance Computational protein design has primarily focused on finding sequences which have very low energy in the target designed structure. However, what is most relevant during folding is not the absolute energy of the folded state, but the energy difference between the folded state and the lowest lying alternative states. We describe a deep learning approach which captures the entire folding landscape, and show that it can enhance current protein design methods.

[1]  Georg Seelig,et al.  Fast differentiable DNA and protein sequence optimization for molecular design , 2020, ArXiv.

[2]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2020, Proceedings of the National Academy of Sciences.

[3]  Mallur S. Madhusudhan,et al.  DEPTH: a web server to compute depth and predict small-molecule binding cavities in proteins , 2011, Nucleic Acids Res..

[4]  Jaime Prilusky,et al.  Automated Design of Efficient and Functionally Diverse Enzyme Repertoires. , 2018, Molecular cell.

[5]  Eric Klavins,et al.  Perturbing the energy landscape for improved packing during computational protein design. , 2020, Proteins.

[6]  D. Baker,et al.  Design of a Novel Globular Protein Fold with Atomic-Level Accuracy , 2003, Science.

[7]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[8]  Aaron Bauer,et al.  De novo protein design by citizen scientists , 2019, Nature.

[9]  J. Skolnick,et al.  TM-align: a protein structure alignment algorithm based on the TM-score , 2005, Nucleic acids research.

[10]  S. L. Mayo,et al.  De novo protein design: fully automated sequence selection. , 1997, Science.

[11]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[12]  M. Tyka,et al.  Principles for computational design of binding antibodies , 2017, Proceedings of the National Academy of Sciences.

[13]  Sarel J. Fleishman,et al.  AbDesign: An algorithm for combinatorial backbone design guided by natural conformations and sequences , 2015, Proteins.

[14]  David Baker,et al.  Accurate de novo design of hyperstable constrained peptides , 2016, Nature.

[15]  Roland L. Dunbrack,et al.  The Rosetta all-atom energy function for macromolecular modeling and design , 2017, bioRxiv.

[16]  Rosalie Lipsh,et al.  Ultrahigh specificity in a network of computationally designed protein-interaction pairs , 2018, Nature Communications.

[17]  Regina Barzilay,et al.  Generative Models for Graph-Based Protein Design , 2019, DGS@ICLR.

[18]  Po-Ssu Huang,et al.  Protein sequence design with a learned potential , 2020, bioRxiv.

[19]  D. Baker,et al.  Global analysis of protein folding using massively parallel design, synthesis, and testing , 2017, Science.

[20]  D. Baker,et al.  Networks of electrostatic and hydrophobic interactions modulate the complex folding free energy surface of a designed βα protein , 2019, Proceedings of the National Academy of Sciences.

[21]  Jaime Prilusky,et al.  Automated Structure- and Sequence-Based Design of Proteins for High Bacterial Expression and Stability , 2016, Molecular cell.

[22]  D. Baker,et al.  Modular repeat protein sculpting using rigid helical junctions , 2020, Proceedings of the National Academy of Sciences.

[23]  Jorge Cortés,et al.  Finite-time convergent gradient flows with applications to network consensus , 2006, Autom..

[24]  D. Baker,et al.  Principles for designing ideal protein structures , 2012, Nature.

[25]  Minkyung Baek,et al.  Improved protein structure refinement guided by deep learning based accuracy estimation , 2020, Nature Communications.

[26]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[27]  Jens Meiler,et al.  RosettaScripts: A Scripting Language Interface to the Rosetta Macromolecular Modeling Suite , 2011, PloS one.

[28]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Roland L. Dunbrack,et al.  proteins STRUCTURE O FUNCTION O BIOINFORMATICS Improved prediction of protein side-chain conformations with SCWRL4 , 2022 .

[30]  David E. Kim,et al.  Simultaneous Optimization of Biomolecular Energy Functions on Features from Small Molecules and Macromolecules. , 2016, Journal of chemical theory and computation.

[31]  Andrew Leaver-Fay,et al.  A Generic Program for Multistate Protein Design , 2011, PloS one.

[32]  David Baker,et al.  De novo protein design by deep network hallucination , 2020, Nature.

[33]  Andrew Leaver-Fay,et al.  Resource Computationally Designed Bispecific Antibodies using Negative State Repertoires Graphical Abstract Highlights , 2016 .

[34]  David T. Jones,et al.  De novo protein design using pairwise potentials and a genetic algorithm , 1994, Protein science : a publication of the Protein Society.

[35]  D. Baker,et al.  RosettaRemodel: A Generalized Framework for Flexible Backbone Protein Design , 2011, PloS one.

[36]  David T. Jones,et al.  Design of metalloproteins and novel protein folds using variational autoencoders , 2018, Scientific Reports.

[37]  D. Baker,et al.  The Highly Cooperative Folding of Small Naturally Occurring Proteins Is Likely the Result of Natural Selection , 2007, Cell.

[38]  D. Baker,et al.  Role of conformational sampling in computing mutation‐induced changes in protein structure and stability , 2011, Proteins.

[39]  Gevorg Grigoryan,et al.  Sequence statistics of tertiary structural motifs reflect protein stability , 2017, PloS one.

[40]  David Baker,et al.  Protein Structure Prediction Using Rosetta , 2004, Numerical Computer Methods, Part D.

[41]  P. Harbury,et al.  Automated design of specificity in molecular recognition , 2003, Nature Structural Biology.

[42]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[43]  Toshiyuki Oda,et al.  Simple adjustment of the sequence weight algorithm remarkably enhances PSI-BLAST performance , 2017, BMC Bioinformatics.