Efficient algorithms for protein sequence design and the analysis of certain evolutionary fitness landscapes

Protein sequence design is a natural inverse problem to protein structure prediction: given a target structure in three dimensions, we wish to design an amino acid sequence that is likely fold to it. A model of Sun, Brem, Chan, and Dill casts this problem as an optimization on a space of sequences of hydrophobic (H) and polar (P) monomers; the goal is to find a sequence that achieves a dense hydrophobic core with few solvent-exposed hydrophobic residues. Sun et al. developed a heuristic method to search the space of sequences, without a guarantee of optimality or near-optimality; Hart subsequently raised the computational tractability of constructing an optimal sequence in this model as an open question. Here we resolve this question by providing an efficient algorithm to construct optimal sequences; our algorithm has a polynomial running time, and performs very efficiently in practice. We illustrate the implementation of our method on structures drawn from the Protein Data Bank. We also consider extensions of the model to larger amino acid alphabets, as a way to overcome the limitations of the binary H/P alphabet. We show that for a natural class of arbitrarily large alphabets, it remains possible to design optimal sequences efficiently. Finally, we analyze some of the consequences of this sequence design model for the study of evolutionary fitness landscapes. A given target structure may have many sequences that are optimal in the model of Sun et al.; following a notion raised by the work of J. Maynard Smith, we can ask whether these optimal sequences are "connected" by successive point mutations. We provide a polynomial-time algorithm to decide this connectedness property, relative to a given target structure. We develop the algorithm by first solving an analogous problem expressed in terms of submodular functions, a fundamental object of study in combinatorial optimization.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  C. Tanford Macromolecules , 1994, Nature.

[3]  G. Giacomello,et al.  Proteins structure. , 1957, Scientia medica italica. English ed.

[4]  John Maynard Smith,et al.  Natural Selection and the Concept of a Protein Space , 1970, Nature.

[5]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1977, Journal of molecular biology.

[6]  J. M. Oshorn Proc. Nat. Acad. Sei , 1978 .

[7]  Frederic M. Richards,et al.  Packing of α-helices: Geometrical constraints and contact areas☆ , 1978 .

[8]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[9]  Drexler Ke,et al.  Molecular engineering: An approach to the development of general capabilities for molecular manipulation. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Robert E. Tarjan,et al.  A data structure for dynamic trees , 1981, STOC '81.

[11]  J. Ponder,et al.  Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. , 1987, Journal of molecular biology.

[12]  A. Goldberg,et al.  A new approach to the maximum-flow problem , 1988, JACM.

[13]  J. G. Pierce,et al.  Geometric Algorithms and Combinatorial Optimization , 2016 .

[14]  K. Dill,et al.  Theory for protein mutability and biogenesis. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Karplus,et al.  Protein folding bottlenecks: A lattice Monte Carlo simulation. , 1991, Physical review letters.

[16]  D. Lipman,et al.  Modelling neutral and selective evolution of protein folding , 1991, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[17]  K. Dill,et al.  Inverse protein folding problem: designing polymer sequences. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[18]  E. Shakhnovich,et al.  A new approach to the design of stable proteins. , 1993, Protein engineering.

[19]  Frank Eisenhaber,et al.  Improved strategy in analytic surface calculation for molecular systems: Handling of singularities and computational efficiency , 1993, J. Comput. Chem..

[20]  L. H. Bradley,et al.  Protein design by binary patterning of polar and nonpolar amino acids. , 1993, Methods in molecular biology.

[21]  C. Sander,et al.  Searching protein structure databases has come of age , 1994, Proteins.

[22]  Scott Wilson The Protein Folding Problem and Tertiary Structure Prediction , 1994, Birkhäuser Boston.

[23]  Chris Sander,et al.  The double cubic lattice method: Efficient approaches to numerical integration of surface area and volume and to dot surface contouring of molecular assemblies , 1995, J. Comput. Chem..

[24]  D. Yee,et al.  Principles of protein folding — A perspective from simple exact models , 1995, Protein science : a publication of the Protein Society.

[25]  K. Dill,et al.  Designing amino acid sequences to fold with good hydrophobic cores. , 1995, Protein engineering.

[26]  C Sander,et al.  Mapping the Protein Universe , 1996, Science.

[27]  M. Huynen,et al.  Smoothness within ruggedness: the role of neutrality in adaptation. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Deutsch,et al.  New algorithm for protein design. , 1995, Physical review letters.

[29]  K. Dill,et al.  Comparing folding codes for proteins and polymers , 1996, Proteins.

[30]  P. Stadler,et al.  Neutral networks in protein space: a computational study based on knowledge-based potentials of mean force. , 1997, Folding & design.

[31]  William E. Hart On the computational complexity of sequence design problems , 1997, RECOMB '97.

[32]  Protein engineering. , 1997, Current Opinion in Biotechnology.

[33]  P. Schuster,et al.  Generic properties of combinatory maps: neutral networks of RNA secondary structures. , 1997, Bulletin of mathematical biology.

[34]  Flavio Seno,et al.  Structure‐based design of model proteins , 1998, Proteins.

[35]  A. Maritan,et al.  Design of proteins with hydrophobic and polar amino acids , 1997, Proteins.

[36]  T. Jukes,et al.  The neutral theory of molecular evolution. , 2000, Genetics.

[37]  M. Nadeau Proteins : Structure , Function , and Genetics , 2022 .