Computational Protein Design Quantifies Structural Constraints on Amino Acid Covariation

Amino acid covariation, where the identities of amino acids at different sequence positions are correlated, is a hallmark of naturally occurring proteins. This covariation can arise from multiple factors, including selective pressures for maintaining protein structure, requirements imposed by a specific function, or from phylogenetic sampling bias. Here we employed flexible backbone computational protein design to quantify the extent to which protein structure has constrained amino acid covariation for 40 diverse protein domains. We find significant similarities between the amino acid covariation in alignments of natural protein sequences and sequences optimized for their structures by computational protein design methods. These results indicate that the structural constraints imposed by protein architecture play a dominant role in shaping amino acid covariation and that computational protein design methods can capture these effects. We also find that the similarity between natural and designed covariation is sensitive to the magnitude and mechanism of backbone flexibility used in computational protein design. Our results thus highlight the necessity of including backbone flexibility to correctly model precise details of correlated amino acid changes and give insights into the pressures underlying these correlations.

[1]  Stephen L Mayo,et al.  Computationally designed libraries of fluorescent proteins evaluated by preservation and diversity of function , 2007, Proceedings of the National Academy of Sciences.

[2]  D. Baker,et al.  A simple physical model for the prediction and design of protein-DNA interactions. , 2004, Journal of molecular biology.

[3]  J. Ponder,et al.  Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. , 1987, Journal of molecular biology.

[4]  G. Gloor,et al.  Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. , 2005, Biochemistry.

[5]  Michael T. Laub,et al.  Rewiring the Specificity of Two-Component Signal Transduction Systems , 2008, Cell.

[6]  Tanja Kortemme,et al.  Control of protein signaling using a computationally designed GTPase/GEF orthogonal pair , 2012, Proceedings of the National Academy of Sciences.

[7]  Claus O. Wilke,et al.  Mistranslation-Induced Protein Misfolding as a Dominant Constraint on Coding-Sequence Evolution , 2008, Cell.

[8]  E. Shakhnovich,et al.  Understanding hierarchical protein evolution from first principles. , 2001, Journal of molecular biology.

[9]  Eyal Akiva,et al.  Prediction of Mutational Tolerance in HIV-1 Protease and Reverse Transcriptase Using Flexible Backbone Protein Design , 2012, PLoS Comput. Biol..

[10]  Pablo Gainza,et al.  Osprey: Protein Design with Ensembles, Flexibility, and Provable Algorithms , 2022 .

[11]  Brian Kuhlman,et al.  Protein design simulations suggest that side‐chain conformational entropy is not a strong determinant of amino acid environmental preferences , 2005, Proteins.

[12]  Bruce Randall Donald,et al.  The Role of Local Backrub Motions in Evolved and Designed Mutations , 2012, PLoS Comput. Biol..

[13]  Philip Bradley,et al.  Structure‐based prediction of protein–peptide specificity in rosetta , 2010, Proteins.

[14]  Ian W. Davis,et al.  The backrub motion: how protein backbone shrugs when a sidechain dances. , 2006, Structure.

[15]  Tanja Kortemme,et al.  Assessment of flexible backbone protein design methods for sequence library prediction in the therapeutic antibody Herceptin–HER2 interface , 2011, Protein science : a publication of the Protein Society.

[16]  M. Levitt,et al.  Simulating protein evolution in sequence and structure space. , 2004, Current opinion in structural biology.

[17]  Tanja Kortemme,et al.  Flexible backbone sampling methods to model and design protein alternative conformations. , 2013, Methods in enzymology.

[18]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[19]  Jens Meiler,et al.  A Correspondence Between Solution-State Dynamics of an Individual Protein and the Sequence and Conformational Diversity of its Family , 2009, PLoS Comput. Biol..

[20]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[21]  Stefan M. Larson,et al.  Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions. , 2000, Journal of molecular biology.

[22]  Colin A. Smith,et al.  Predicting the Tolerated Sequences for Proteins and Protein Interfaces Using RosettaBackrub Flexible Backbone Design , 2011, PloS one.

[23]  D. Baker,et al.  Native protein sequences are close to optimal for their structures. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Ora Schueler-Furman,et al.  Identification of a Novel Class of Farnesylation Targets by Structure-Based Modeling of Binding Specificity , 2011, PLoS Comput. Biol..

[25]  François Stricher,et al.  How Protein Stability and New Functions Trade Off , 2008, PLoS Comput. Biol..

[26]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[27]  Tanja Kortemme,et al.  Design of Multi-Specificity in Protein Interfaces , 2007, PLoS Comput. Biol..

[28]  Vijay S Pande,et al.  Thoroughly sampling sequence space: Large‐scale protein design of structural ensembles , 2002, Protein science : a publication of the Protein Society.

[29]  Colin A. Smith,et al.  Backrub-like backbone simulation recapitulates natural protein conformational variability and improves mutant side-chain prediction. , 2008, Journal of molecular biology.

[30]  David Baker,et al.  Protein Structure Prediction Using Rosetta , 2004, Numerical Computer Methods, Part D.

[31]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[32]  Tanja Kortemme,et al.  Structure-based prediction of the peptide sequence space recognized by natural and synthetic PDZ domains. , 2010, Journal of molecular biology.

[33]  Kristala L J Prather,et al.  Engineering enzyme specificity using computational design of a defined-sequence library. , 2010, Chemistry & biology.

[34]  Amy E Keating,et al.  Predictive Bcl-2 family binding models rooted in experiment or structure. , 2012, Journal of molecular biology.

[35]  D. Baker,et al.  Alternate states of proteins revealed by detailed energy landscape mapping. , 2011, Journal of molecular biology.

[36]  Christopher T. Saunders,et al.  Recapitulation of protein family divergence using flexible backbone protein design. , 2005, Journal of molecular biology.

[37]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[38]  Joost Schymkowitz,et al.  The stability effects of protein mutations appear to be universally distributed. , 2007, Journal of molecular biology.

[39]  Patrice Koehl,et al.  Protein topology and stability define the space of allowed sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[40]  D. Baker,et al.  Design of a Novel Globular Protein Fold with Atomic-Level Accuracy , 2003, Science.

[41]  Golan Yona,et al.  Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. , 2002, Journal of molecular biology.

[42]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[43]  Gregory B. Gloor,et al.  Identifying and Seeing beyond Multiple Sequence Alignment Errors Using Intra-Molecular Protein Covariation , 2010, PloS one.

[44]  Elisabeth L. Humphris,et al.  Prediction of protein-protein interface sequence diversity using flexible backbone computational protein design. , 2008, Structure.

[45]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[46]  E. Coutsias,et al.  Sub-angstrom accuracy in protein loop reconstruction by robotics-inspired conformational sampling , 2009, Nature Methods.

[47]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[48]  R. Ranganathan,et al.  Evolutionarily conserved pathways of energetic connectivity in protein families. , 1999, Science.