Thoroughly sampling sequence space: Large‐scale protein design of structural ensembles

Modeling the inherent flexibility of the protein backbone as part of computational protein design is necessary to capture the behavior of real proteins and is a prerequisite for the accurate exploration of protein sequence space. We present the results of a broad exploration of sequence space, with backbone flexibility, through a novel approach: large‐scale protein design to structural ensembles. A distributed computing architecture has allowed us to generate hundreds of thousands of diverse sequences for a set of 253 naturally occurring proteins, allowing exciting insights into the nature of protein sequence space. Designing to a structural ensemble produces a much greater diversity of sequences than previous studies have reported, and homology searches using profiles derived from the designed sequences against the Protein Data Bank show that the relevance and quality of the sequences is not diminished. The designed sequences have greater overall diversity than corresponding natural sequence alignments, and no direct correlations are seen between the diversity of natural sequence alignments and the diversity of the corresponding designed sequences. For structures in the same fold, the sequence entropies of the designed sequences cluster together tightly. This tight clustering of sequence entropies within a fold and the separation of sequence entropy distributions for different folds suggest that the diversity of designed sequences is primarily determined by a structure's overall fold, and that the designability principle postulated from studies of simple models holds in real proteins. This has important implications for experimental protein design and engineering, as well as providing insight into protein evolution.

[1]  Frances H. Arnold,et al.  Computational method to reduce the search space for directed protein evolution , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  D. Baker,et al.  Contact order, transition state placement and the refolding rates of single domain proteins. , 1998, Journal of molecular biology.

[3]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[4]  U. Singh,et al.  A NEW FORCE FIELD FOR MOLECULAR MECHANICAL SIMULATION OF NUCLEIC ACIDS AND PROTEINS , 1984 .

[5]  N. Wingreen,et al.  The designability of protein structures. , 2001, Journal of molecular graphics & modelling.

[6]  Andrea Musacchio,et al.  High-resolution crystal structures of tyrosine kinase SH3 domains complexed with proline-rich peptides , 1994, Nature Structural Biology.

[7]  M. Levitt,et al.  De novo protein design. I. In search of stability and specificity. , 1999, Journal of molecular biology.

[8]  C. Gustafsson,et al.  Directed evolution: the 'rational' basis for 'irrational' design. , 2000, Current opinion in structural biology.

[9]  W. L. Jorgensen,et al.  The OPLS [optimized potentials for liquid simulations] potential functions for proteins, energy minimizations for crystals of cyclic peptides and crambin. , 1988, Journal of the American Chemical Society.

[10]  Andrew M Wollacott,et al.  Prediction of amino acid sequence from structure , 2000, Protein science : a publication of the Protein Society.

[11]  N. Wingreen,et al.  Emergence of Preferred Structures in a Simple Model of Protein Folding , 1996, Science.

[12]  C. Chothia Proteins. One thousand families for the molecular biologist. , 1992, Nature.

[13]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[14]  B. Dahiyat,et al.  In silico design for protein stabilization. , 1999, Current opinion in biotechnology.

[15]  H Kono,et al.  Statistical Theory for Protein Combinatorial Libraries , 2001 .

[16]  E I Shakhnovich,et al.  Protein design: a perspective from simple tractable models , 1998, Folding & design.

[17]  Nicolas E. Buchler,et al.  Effect of alphabet size and foldability requirements on protein structure designability , 1999, Proteins.

[18]  Christopher A. Voigt,et al.  Trading accuracy for speed: A quantitative comparison of search algorithms in protein sequence design. , 2000, Journal of molecular biology.

[19]  D. Baker,et al.  Native protein sequences are close to optimal for their structures. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Yanli Wang,et al.  MMDB: 3D structure data in Entrez , 2000, Nucleic Acids Res..

[21]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[22]  B. Matthews,et al.  Response of a protein structure to cavity-creating mutations and its relation to the hydrophobic effect. , 1992, Science.

[23]  Patrice Koehl,et al.  Protein topology and stability define the space of allowed sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[24]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[25]  A. D. McLachlan,et al.  Solvation energy in protein folding and binding , 1986, Nature.

[26]  T M Handel,et al.  Review: protein design--where we were, where we are, where we're going. , 2001, Journal of structural biology.

[27]  R. Kazlauskas,et al.  Molecular modeling and biocatalysis: explanations, predictions, limitations, and opportunities. , 2000, Current opinion in chemical biology.

[28]  B. Erman,et al.  Information‐theoretical entropy as a measure of sequence variability , 1991, Proteins.

[29]  S J Wodak,et al.  Automatic protein design with all atom force-fields by exact and heuristic optimization. , 2000, Journal of molecular biology.

[30]  Michael R. Shirts,et al.  COMPUTING: Screen Savers of the World Unite! , 2000, Science.

[31]  Stephen L. Mayo,et al.  Design, structure and stability of a hyperthermophilic protein variant , 1998, Nature Structural Biology.

[32]  J G Saven,et al.  Statistical theory for protein combinatorial libraries. Packing interactions, backbone flexibility, and the sequence variability of a main-chain structure. , 2001, Journal of molecular biology.

[33]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[34]  A. Ménez,et al.  Tailoring new enzyme functions by rational redesign. , 2000, Current opinion in structural biology.

[35]  P. S. Kim,et al.  High-resolution protein design with backbone freedom. , 1998, Science.

[36]  P. Koehl,et al.  Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. , 1994, Journal of molecular biology.

[37]  N S Wingreen,et al.  Are protein folds atypical? , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[38]  J G Saven,et al.  Statistical theory of combinatorial libraries of folding proteins: energetic discrimination of a target structure. , 2000, Journal of molecular biology.

[39]  C. Lee,et al.  Predicting protein mutant energetics by self-consistent ensemble optimization. , 1994, Journal of molecular biology.

[40]  S L Mayo,et al.  Coupling backbone flexibility and amino acid sequence selection in protein design , 1997, Protein science : a publication of the Protein Society.

[41]  M. Levitt,et al.  Improved recognition of native-like protein structures using a family of designed sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[42]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[43]  Stefan M. Larson,et al.  Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions. , 2000, Journal of molecular biology.

[44]  J R Desjarlais,et al.  Computer search algorithms in protein modification and design. , 1998, Current opinion in structural biology.

[45]  Vijay S. Pande,et al.  Screen Savers of the World Unite! , 2000, Science.

[46]  J R Desjarlais,et al.  Side-chain and backbone flexibility in protein core design. , 1999, Journal of molecular biology.

[47]  B. Matthews,et al.  The role of backbone flexibility in the accommodation of variants that repack the core of T4 lysozyme. , 1994, Science.

[48]  M. Levitt,et al.  De novo protein design. II. Plasticity in sequence space. , 1999, Journal of molecular biology.

[49]  Chen Zeng,et al.  Emergence of highly designable protein‐backbone conformations in an off‐lattice model , 2001, Proteins.

[50]  C. Chothia,et al.  Population statistics of protein structures: lessons from structural classifications. , 1997, Current opinion in structural biology.

[51]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[52]  V S Pande,et al.  Statistical mechanics of simple models of protein folding and design. , 1997, Biophysical journal.

[53]  U. Bornscheuer,et al.  Improved biocatalysts by directed evolution and rational protein design. , 2001, Current opinion in chemical biology.

[54]  J R Desjarlais,et al.  De novo design of the hydrophobic cores of proteins , 1995, Protein science : a publication of the Protein Society.

[55]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[56]  C. Pabo Molecular technology: Designing proteins and peptides , 1983, Nature.