Algorithms for Joint Optimization of Stability and Diversity in Planning Combinatorial Libraries of Chimeric Proteins

In engineering protein variants by constructing and screening combinatorial libraries of chimeric proteins, two complementary and competing goals are desired: the new proteins must be similar enough to the evolutionarily-selected wild-type proteins to be stably folded, and they must be different enough to display functional variation. We present here the first method, Staversity, to simultaneously optimize stability and diversity in selecting sets of breakpoint locations for site-directed recombination. Our goal is to uncover all "undominated" breakpoint sets, for which no other breakpoint set is better in both factors. Our first algorithm finds the undominated sets serving as the vertices of the lower envelope of the two-dimensional (stability and diversity) convex hull containing all possible breakpoint sets. Our second algorithm identifies additional breakpoint sets in the concavities that are either undominated or dominated only by undiscovered breakpoint sets within a distance bound computed by the algorithm. Both algorithms are efficient, requiring only time polynomial in the numbers of residues and breakpoints, while characterizing a space defined by an exponential number of possible breakpoint sets. We applied Staversity to identify 2-10 breakpoint sets for three different sets of parent proteins from the purE family of biosynthetic enzymes. The average normalized distance between our plans and the lower bound for optimal plans is around 1 percent. Our plans dominate most (60-90% on average for each parent set) of the plans found by other possible approaches, random sampling or explicit optimization for stability with implicit optimization for diversity. The identified breakpoint sets provide a compact representation of good plans, enabling a protein engineer to understand and account for the trade-offs between two key considerations in combinatorial chimeragenesis.

[1]  Dennis G. Severance,et al.  Mathematical Techniques for Efficient Record Segmentation in Large Shared Databases , 1976, JACM.

[2]  A. Bykat,et al.  Convex Hull of a Finite Set of Points in Two Dimensions , 1978, Inf. Process. Lett..

[3]  Dan Gusfield,et al.  Parametric Combinatorial Computing and a Problem of Program Module Distribution , 1983, JACM.

[4]  E. Lander,et al.  Parametric sequence comparisons. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Dan Gusfield,et al.  Parametric optimization of sequence alignment , 1992, SODA '92.

[6]  S M Firestine,et al.  Reactions catalyzed by 5-aminoimidazole ribonucleotide carboxylases from Escherichia coli and Gallus gallus: a case for divergent catalytic mechanisms. , 1994, Biochemistry.

[7]  W. Stemmer Rapid evolution of a protein in vitro by DNA shuffling , 1994, Nature.

[8]  M. Zaccolo,et al.  The effect of high-frequency random mutagenesis on in vitro protein evolution: a study on TEM-1 beta-lactamase. , 1999, Journal of molecular biology.

[9]  G. Georgiou,et al.  Quantitative analysis of the effect of the mutation frequency on the affinity maturation of single chain Fv antibodies. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[10]  A. Tropsha,et al.  Four-body potentials reveal protein-specific correlations to stability changes caused by hydrophobic core mutations. , 2001, Journal of molecular biology.

[11]  Christopher A. Voigt,et al.  Protein building blocks preserved by recombination , 2002, Nature Structural Biology.

[12]  Alexander Tropsha,et al.  Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations , 2003, Bioinform..

[13]  Frances H Arnold,et al.  Staggered extension process (StEP) in vitro recombination. , 2003, Methods in molecular biology.

[14]  W. Coco,et al.  RACHITT: Gene family shuffling by Random Chimeragenesis on Transient Templates. , 2003, Methods in molecular biology.

[15]  Frances H Arnold,et al.  Library analysis of SCHEMA‐guided protein recombination , 2003, Protein science : a publication of the Protein Society.

[16]  Costas D Maranas,et al.  Identifying residue–residue clashes in protein hybrids by using a second-order mean-field approach , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Frances H Arnold,et al.  To whom correspondence should be addressed. , 2022 .

[18]  Christopher A. Voigt,et al.  Functional evolution and structural conservation in chimeric cytochromes p450: calibrating a structure-guided approach. , 2004, Chemistry & biology.

[19]  Costas D Maranas,et al.  Design of combinatorial protein libraries of optimal size , 2005, Proteins.

[20]  Frances H. Arnold,et al.  Structure-guided SCHEMA recombination of distantly related β-lactamases , 2006 .

[21]  Chris Bailey-Kellogg,et al.  Site‐directed combinatorial construction of chimaeric genes: General method for optimizing assembly of gene fragments , 2006, Proteins.

[22]  Jeffrey B. Endelman,et al.  Structure-Guided Recombination Creates an Artificial Family of Cytochromes P450 , 2006, PLoS biology.

[23]  Chris Bailey-Kellogg,et al.  Algorithms for selecting breakpoint locations to optimize diversity in protein engineering by site-directed protein recombination. , 2007, Computational systems bioinformatics. Computational Systems Bioinformatics Conference.

[24]  F. Arnold,et al.  Diversification of catalytic function in a synthetic family of chimeric cytochrome p450s. , 2007, Chemistry & biology.

[25]  Chris Bailey-Kellogg,et al.  Hypergraph Model of Multi-Residue Interactions in Proteins: Sequentially-Constrained Partitioning Algorithms for Optimization of Site-Directed Protein Recombination , 2007, J. Comput. Biol..

[26]  C. Bailey-Kellogg,et al.  Graphical Models of Residue Coupling in Protein Families , 2008, TCBB.