Navigating the protein fitness landscape with Gaussian processes

Knowing how protein sequence maps to function (the “fitness landscape”) is critical for understanding protein evolution as well as for engineering proteins with new and useful properties. We demonstrate that the protein fitness landscape can be inferred from experimental data, using Gaussian processes, a Bayesian learning technique. Gaussian process landscapes can model various protein sequence properties, including functional status, thermostability, enzyme activity, and ligand binding affinity. Trained on experimental data, these models achieve unrivaled quantitative accuracy. Furthermore, the explicit representation of model uncertainty allows for efficient searches through the vast space of possible sequences. We develop and test two protein sequence design algorithms motivated by Bayesian decision theory. The first one identifies small sets of sequences that are informative about the landscape; the second one identifies optimized sequences by iteratively improving the Gaussian process model in regions of the landscape that are predicted to be optimized. We demonstrate the ability of Gaussian processes to guide the search through protein sequence space by designing, constructing, and testing chimeric cytochrome P450s. These algorithms allowed us to engineer active P450 enzymes that are more thermostable than any previously made by chimeragenesis, rational design, or directed evolution.

[1]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[2]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[3]  Alexander K. Kelmans,et al.  Multiplicative submodularity of a matrix's principal minor as a function of the set of its rows and some combinatorial applications , 1983, Discret. Math..

[4]  H. Barnes,et al.  Expression and enzymatic activity of recombinant cytochrome P450 17 alpha-hydroxylase in Escherichia coli. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[5]  I. Guttman,et al.  Comparing probabilistic methods for outlier detection in linear models , 1993 .

[6]  S. Gavrilets Evolution and speciation on holey adaptive landscapes. , 1997, Trends in ecology & evolution.

[7]  S. L. Mayo,et al.  De novo protein design: fully automated sequence selection. , 1997, Science.

[8]  W. Mandecki The game of chess and searches in protein sequence space , 1998 .

[9]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[10]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Roger Woodard,et al.  Interpolation of Spatial Data: Some Theory for Kriging , 1999, Technometrics.

[12]  Anthony D. Keefe,et al.  Functional proteins from a random-sequence library , 2001, Nature.

[13]  Jon Lee Maximum entropy sampling , 2001 .

[14]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[15]  D. M. Taverna,et al.  Why are proteins marginally stable? , 2002, Proteins.

[16]  Niles A Pierce,et al.  Protein design is NP-hard. , 2002, Protein engineering.

[17]  C. Otey,et al.  High-throughput screen for aromatic hydroxylation. , 2003, Methods in molecular biology.

[18]  F. Arnold,et al.  Thermostabilization of a Cytochrome P450 Peroxygenase , 2003, Chembiochem : a European journal of chemical biology.

[19]  C. Otey High-throughput carbon monoxide binding assay for cytochromes p450. , 2003, Methods in molecular biology.

[20]  D. Axe Estimating the prevalence of protein sequences adopting functional enzyme folds. , 2004, Journal of molecular biology.

[21]  Hongyi Zhou,et al.  An accurate, residue‐level, pair potential of mean force for folding and binding based on the distance‐scaled, ideal‐gas reference state , 2004, Protein science : a publication of the Protein Society.

[22]  Christopher A. Voigt,et al.  Functional evolution and structural conservation in chimeric cytochromes p450: calibrating a structure-guided approach. , 2004, Chemistry & biology.

[23]  Andreas Krause,et al.  Near-optimal sensor placements in Gaussian processes , 2005, ICML.

[24]  C. Wilke,et al.  On the conservative nature of intragenic recombination. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[25]  H. A. Orr,et al.  The distribution of fitness effects among beneficial mutations in Fisher's geometric model of adaptation. , 2006, Journal of theoretical biology.

[26]  Jeffrey B. Endelman,et al.  Structure-Guided Recombination Creates an Artificial Family of Cytochromes P450 , 2006, PLoS biology.

[27]  Andreas Krause,et al.  Near-optimal Observation Selection using Submodular Functions , 2007, AAAI.

[28]  F. Arnold,et al.  Diversification of catalytic function in a synthetic family of chimeric cytochrome p450s. , 2007, Chemistry & biology.

[29]  Manfred K. Warmuth,et al.  Engineering proteinase K using machine learning and synthetic genes , 2007, BMC biotechnology.

[30]  F. Arnold,et al.  A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments , 2007, Nature Biotechnology.

[31]  Tao Wang,et al.  Automatic Gait Optimization with Gaussian Process Regression , 2007, IJCAI.

[32]  John C Whitman,et al.  Improving catalytic function by ProSAR-driven enzyme evolution , 2007, Nature Biotechnology.

[33]  Eric A. Althoff,et al.  De Novo Computational Design of Retro-Aldol Enzymes , 2008, Science.

[34]  Warren B. Powell,et al.  A Knowledge-Gradient Policy for Sequential Information Collection , 2008, SIAM J. Control. Optim..

[35]  Andreas Krause,et al.  Toward Community Sensing , 2008, 2008 International Conference on Information Processing in Sensor Networks (ipsn 2008).

[36]  Philip A. Romero,et al.  Exploring protein fitness landscapes by directed evolution , 2009, Nature Reviews Molecular Cell Biology.

[37]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[38]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[39]  David Baker,et al.  An exciting but challenging road ahead for computational enzyme design , 2010, Protein science : a publication of the Protein Society.

[40]  Mikhail G. Shapiro,et al.  Directed evolution of a magnetic resonance imaging contrast agent for noninvasive imaging of dopamine , 2010, Nature Biotechnology.

[41]  F. Arnold,et al.  Structure-guided directed evolution of highly selective p450-based magnetic resonance imaging sensors for dopamine and serotonin. , 2012, Journal of molecular biology.

[42]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[43]  S. Kakade,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2012, IEEE Transactions on Information Theory.