Machine-learning-guided directed evolution for protein engineering

Protein engineering through machine-learning-guided directed evolution enables the optimization of protein functions. Machine-learning approaches predict how sequence maps to function in a data-driven manner without requiring a detailed model of the underlying physics or biological pathways. Such methods accelerate directed evolution by learning from the properties of characterized variants and using that information to select sequences that are likely to exhibit improved properties. Here we introduce the steps required to build machine-learning sequence–function models and to use those models to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to the use of machine learning for protein engineering, as well as the current literature and applications of this engineering paradigm. We illustrate the process with two case studies. Finally, we look to future opportunities for machine learning to enable the discovery of unknown protein functions and uncover the relationship between protein sequence and function.This review provides an overview of machine learning techniques in protein engineering and illustrates the underlying principles with the help of case studies.

[1]  E. Nadaraya On Estimating Regression , 1964 .

[2]  John Maynard Smith,et al.  Natural Selection and the Concept of a Protein Space , 1970, Nature.

[3]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[4]  S. D. Jong SIMPLS: an alternative approach to partial least squares regression , 1993 .

[5]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[6]  S. L. Mayo,et al.  De novo protein design: fully automated sequence selection. , 1997, Science.

[7]  Frances H. Arnold,et al.  Molecular evolution by staggered extension process (StEP) in vitro recombination , 1998, Nature Biotechnology.

[8]  W. Mandecki The game of chess and searches in protein sequence space , 1998 .

[9]  Motonori Ota,et al.  The Protein Mutant Database , 1999, Nucleic Acids Res..

[10]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[11]  J. Friedman Stochastic gradient boosting , 2002 .

[12]  Niles A Pierce,et al.  Protein design is NP-hard. , 2002, Protein engineering.

[13]  Claes Gustafsson,et al.  Optimizing the search algorithm for protein engineering by directed evolution. , 2003, Protein engineering.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Frances H Arnold,et al.  To whom correspondence should be addressed. , 2022 .

[16]  Piero Fariselli,et al.  A neural-network-based method for predicting protein stability changes upon single point mutations , 2004, ISMB/ECCB.

[17]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[18]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[19]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[20]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[21]  Piero Fariselli,et al.  Predicting protein stability changes from sequences using support vector machines , 2005, ECCB/JBI.

[22]  Trevor Darrell,et al.  Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing) , 2006 .

[23]  Piero Fariselli,et al.  I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure , 2005, Nucleic Acids Res..

[24]  Michael Lowry,et al.  Maximal use of minimal libraries through the adaptive substituent reordering algorithm. , 2005, The journal of physical chemistry. B.

[25]  C. Wilke,et al.  On the conservative nature of intragenic recombination. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[26]  Manfred T Reetz,et al.  Directed evolution of enantioselective enzymes: iterative cycles of CASTing for probing protein-sequence space. , 2006, Angewandte Chemie.

[27]  Arlo Z. Randall,et al.  Prediction of protein stability changes for single‐site mutations using support vector machines , 2005, Proteins.

[28]  Akinori Sarai,et al.  ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions , 2005, Nucleic Acids Res..

[29]  H. A. Orr,et al.  The distribution of fitness effects among beneficial mutations in Fisher's geometric model of adaptation. , 2006, Journal of theoretical biology.

[30]  Manfred K. Warmuth,et al.  Engineering proteinase K using machine learning and synthetic genes , 2007, BMC biotechnology.

[31]  F. Arnold,et al.  A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments , 2007, Nature Biotechnology.

[32]  John C Whitman,et al.  Improving catalytic function by ProSAR-driven enzyme evolution , 2007, Nature Biotechnology.

[33]  William Stafford Noble,et al.  A structural alignment kernel for protein structures , 2007, Bioinform..

[34]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[35]  Philip A. Romero,et al.  Exploring protein fitness landscapes by directed evolution , 2009, Nature Reviews Molecular Cell Biology.

[36]  Fabian A. Buske,et al.  In silico characterization of protein chimeras: Relating sequence and function within the same fold , 2009, Proteins.

[37]  Philippe Bogaerts,et al.  Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0 , 2009, Bioinform..

[38]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[39]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[40]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[41]  Xiaoyu Chu,et al.  Predicting changes in protein thermostability brought about by single- or multi-site mutations , 2010, BMC Bioinformatics.

[42]  Dan S. Tawfik,et al.  Enzyme promiscuity: a mechanistic and evolutionary perspective. , 2010, Annual review of biochemistry.

[43]  David Baker,et al.  An exciting but challenging road ahead for computational enzyme design , 2010, Protein science : a publication of the Protein Society.

[44]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[45]  Jianguo Liu,et al.  Grading amino acid properties increased accuracies of single point mutation on protein stability prediction , 2011, BMC Bioinformatics.

[46]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[47]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[48]  Manfred T Reetz,et al.  Enhancing the efficiency of directed evolution in focused enzyme libraries by the adaptive substituent reordering algorithm. , 2012, Chemistry.

[49]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[50]  Juan Fernández-Recio,et al.  SKEMPI: a Structural Kinetic and Energetic database of Mutant Protein Interactions and its use in empirical models , 2012, Bioinform..

[51]  Jianwen Fang,et al.  PROTS-RF: A Robust Model for Predicting Mutation-Induced Protein Stability Changes , 2012, PloS one.

[52]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.

[53]  Andreas Krause,et al.  Navigating the protein fitness landscape with Gaussian processes , 2012, Proceedings of the National Academy of Sciences.

[54]  Kevin Y. Yip,et al.  Machine learning and genome annotation: a match meant to be? , 2013, Genome Biology.

[55]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Ole Winther,et al.  Protein Secondary Structure Prediction with Long Short Term Memory Networks , 2014, ArXiv.

[57]  S. Fields,et al.  Deep mutational scanning: a new style of protein science , 2014, Nature Methods.

[58]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[59]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[60]  Douglas E. V. Pires,et al.  mCSM: predicting the effects of mutations in proteins using graph-based signatures , 2013, Bioinform..

[61]  Ian Walsh,et al.  NeEMO: a method using residue interaction networks to improve prediction of protein stability upon mutation , 2014, BMC Genomics.

[62]  Claes Gustafsson,et al.  Mapping of amino acid substitutions conferring herbicide resistance in wheat glutathione transferase. , 2015, ACS synthetic biology.

[63]  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[64]  Ehsaneddin Asgari,et al.  ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.

[65]  Andrew Gordon Wilson,et al.  Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) , 2015, ICML.

[66]  Michael T. Laub,et al.  Pervasive degeneracy and epistasis in a protein-protein interface , 2015, Science.

[67]  J. Kitzman,et al.  Massively Parallel Single Amino Acid Mutagenesis , 2014, Nature Methods.

[68]  B. Ripley Classification and Regression Trees , 2015 .

[69]  Michal Linial,et al.  ProFET: Feature engineering captures high-level protein functions , 2015, Bioinform..

[70]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[71]  Lei Jia,et al.  Structure Based Thermostability Prediction Models for Protein Single Point Mutations with Machine Learning Tools , 2015, PloS one.

[72]  Cathy H. Wu,et al.  UniProt: the universal protein knowledgebase , 2016, Nucleic Acids Research.

[73]  James R. Apgar,et al.  AB‐Bind: Antibody binding mutational database for computational affinity predictions , 2016, Protein science : a publication of the Protein Society.

[74]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[75]  Adam S Dingens,et al.  Experimental Estimation of the Effects of All Amino-Acid Mutations to HIV’s Envelope Protein on Viral Replication in Cell Culture , 2016, PLoS pathogens.

[76]  D. Baker,et al.  The coming of age of de novo protein design , 2016, Nature.

[77]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[78]  David K. Gifford,et al.  Convolutional neural network architectures for predicting DNA–protein binding , 2016, Bioinform..

[79]  Pablo Carbonell,et al.  Semisupervised Gaussian Process for Automated Enzyme Search. , 2016, ACS synthetic biology.

[80]  Jianyang Zeng,et al.  A deep learning framework for modeling structural features of RNA-binding protein targets , 2015, Nucleic acids research.

[81]  James M. Hogan,et al.  Distributed Representations for Biological Sequence Analysis , 2016, ArXiv.

[82]  Martin A. Nowak,et al.  Variational auto-encoding of protein sequences , 2017, ArXiv.

[83]  Patrick Ng,et al.  dna2vec: Consistent vector representations of variable-length k-mers , 2017, ArXiv.

[84]  Vijay S. Pande,et al.  Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity , 2017, ArXiv.

[85]  David Baker,et al.  Sampling and energy evaluation challenges in ligand binding protein design , 2017, Protein science : a publication of the Protein Society.

[86]  Joseph W. Thornton,et al.  Alternate evolutionary histories in the sequence space of an ancient protein , 2017, Nature.

[87]  Ole Winther,et al.  DeepLoc: prediction of protein subcellular localization using deep learning , 2017, Bioinform..

[88]  Bengt Mannervik,et al.  Exploring sequence-function space of a poplar glutathione transferase using designed information-rich gene variants , 2017, Protein engineering, design & selection : PEDS.

[89]  Gianni De Fabritiis,et al.  DeepSite: protein‐binding site predictor using 3D‐convolutional neural networks , 2017, Bioinform..

[90]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[91]  David S. Goodsell,et al.  The RCSB protein data bank: integrative view of protein, gene and 3D structural information , 2016, Nucleic Acids Res..

[92]  Mikael Bodén,et al.  Learning epistatic interactions from sequence-activity data to predict enantioselectivity , 2017, Journal of Computer-Aided Molecular Design.

[93]  M. Michael Gromiha,et al.  PROXiMATE: a database of mutant protein-protein complex thermodynamics and kinetics , 2017, Bioinform..

[94]  Frances H. Arnold,et al.  Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization , 2017, PLoS Comput. Biol..

[95]  Carlo Mazzaferro Predicting Protein Binding Affinity With Word Embeddings and Recurrent Neural Networks , 2017, bioRxiv.

[96]  Jianjun Hu,et al.  DeepMHC: Deep Convolutional Neural Networks for High-performance peptide-MHC Binding Affinity Prediction , 2017, bioRxiv.

[97]  Ole Winther,et al.  An introduction to deep learning on biological sequence data: examples and solutions , 2017, Bioinform..

[98]  Debora S. Marks,et al.  Deep generative models of genetic variation capture mutation effects , 2017, bioRxiv.

[99]  Timothy A. Whitehead,et al.  Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded , 2017, Nature Communications.

[100]  Justin R Klesmith,et al.  Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning , 2017, Proceedings of the National Academy of Sciences.

[101]  Kendall N. Houk,et al.  Chapter 4:Computational Design of Protein Function , 2017 .

[102]  Yutaka Saito,et al.  Machine-Learning-Guided Mutagenesis for Directed Evolution of Fluorescent Proteins. , 2018, ACS synthetic biology.

[103]  James Zou,et al.  Feedback GAN (FBGAN) for DNA: a Novel Feedback-Loop Architecture for Optimizing Protein Functions , 2018, ArXiv.

[104]  Jennifer Listgarten,et al.  Design by adaptive sampling , 2018, ArXiv.

[105]  Arjun K. Bansal,et al.  Deep Semantic Protein Representation for Annotation, Discovery, and Engineering , 2018, bioRxiv.

[106]  Royston Goodacre,et al.  Improved Descriptors for the Quantitative Structure-Activity Relationship Modeling of Peptides and Proteins , 2018, J. Chem. Inf. Model..

[107]  Haohan Wang,et al.  Deep Learning for Genomics: A Concise Overview , 2018, ArXiv.

[108]  Gisbert Schneider,et al.  Recurrent Neural Network Model for Constructive Peptide Design , 2018, J. Chem. Inf. Model..

[109]  Colin Raffel,et al.  A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music , 2018, ICML.

[110]  Namrata Anand,et al.  Generative modeling for protein structures , 2018, NeurIPS.

[111]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[112]  Raghvendra Mall,et al.  DeepSol: a deep learning framework for sequence‐based protein solubility prediction , 2018, Bioinform..

[113]  Vince Grolmusz,et al.  Near Perfect Protein Multi-Label Classification with Deep Neural Networks , 2017, Methods.

[114]  Zachary Wu,et al.  Machine learning in protein engineering , 2018, 1811.10775.

[115]  Roberto A Chica,et al.  ProtaBank: A repository for protein design and engineering data , 2018, bioRxiv.

[116]  Yang Yang,et al.  PON-tstab: Protein Variant Stability Predictor. Importance of Training Data Quality , 2018, International journal of molecular sciences.

[117]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[118]  Markus Heinonen,et al.  mGPfusion: predicting protein stability changes with Gaussian process kernel learning and data fusion , 2018, Bioinform..

[119]  Nauman Javed,et al.  A statistical model for improved membrane protein expression using sequence-derived features , 2018, The Journal of Biological Chemistry.

[120]  Debora S Marks,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[121]  Forrest Sheng Bao,et al.  Leveraging knowledge engineering and machine learning for microbial bio-manufacturing. , 2018, Biotechnology advances.

[122]  Jennifer Listgarten,et al.  Conditioning by adaptive sampling for robust design , 2019, ICML.

[123]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[124]  Zachary Wu,et al.  Machine learning-assisted directed protein evolution with combinatorial libraries , 2019, Proceedings of the National Academy of Sciences.

[125]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[126]  Frances H. Arnold,et al.  Machine learning-guided channelrhodopsin engineering enables minimally-invasive optogenetics , 2019, Nature Methods.

[127]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[128]  D. Sculley,et al.  Using deep learning to annotate the protein universe , 2019, Nature Biotechnology.

[129]  Yisong Yue,et al.  Batched Stochastic Bayesian Optimization via Combinatorial Constraints Design , 2019, AISTATS.

[130]  Zak Costello,et al.  How to Hallucinate Functional Proteins , 2019, 1903.00458.

[131]  Stephen Tyree,et al.  Exact Gaussian Processes on a Million Data Points , 2019, NeurIPS.