Machine learning modeling of family wide enzyme-substrate specificity screens

Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive modeling challenge, efforts to date have primarily aimed to increase activity against a single known substrate, rather than to identify enzymes capable of acting on new substrates of interest. To address this need, we curate 6 different high-quality enzyme family screens from the literature that each measure multiple enzymes against multiple substrates. We compare machine learning-based compound-protein interaction (CPI) modeling approaches from the literature used for predicting drug-target interactions. Surprisingly, comparing these interaction-based models against collections of independent (single task) enzyme-only or substrate-only models reveals that current CPI approaches are incapable of learning interactions between compounds and proteins in the current family level data regime. 1 ar X iv :2 10 9. 03 90 0v 1 [ qbi o. B M ] 8 S ep 2 02 1 We further validate this observation by demonstrating that our no-interaction baseline can outperform CPI-based models from the literature used to guide the discovery of kinase inhibitors. Given the high performance of non-interaction based models, we introduce a new structure-based strategy for pooling residue representations across a protein sequence. Altogether, this work motivates a principled path forward in order to build and evaluate meaningful predictive models for biocatalysis and other drug discovery applications.

[1]  Rapid Screening of Diverse Biotransformations for Enzyme Evolution , 2021, JACS Au.

[2]  Christopher A. Voigt,et al.  Synthetic biology 2020–2030: six commercially-available products that are changing our world , 2020, Nature Communications.

[3]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[4]  Hua Huang,et al.  Panoramic view of a superfamily of phosphatases through substrate profiling , 2015, Proceedings of the National Academy of Sciences.

[5]  Philip A. Romero,et al.  Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production , 2021, Nature Communications.

[6]  Frances H. Arnold,et al.  Machine learning-guided channelrhodopsin engineering enables minimally-invasive optogenetics , 2019, Nature Methods.

[7]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[8]  Brian Raught,et al.  A Strategy for Modulation of Enzymes in the Ubiquitin System , 2013, Science.

[9]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[10]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[11]  Connor W. Coley,et al.  Machine Learning in Computer-Aided Synthesis Planning. , 2018, Accounts of chemical research.

[12]  Ethan C. Alley,et al.  Low-N protein engineering with data-efficient deep learning , 2020, Nature Methods.

[13]  Chang C. Liu,et al.  Scalable continuous evolution for the generation of diverse enzyme variants encompassing promiscuous activities , 2020, Nature Communications.

[14]  Bonnie Berger,et al.  Learning protein sequence embeddings using information from structure , 2019, ICLR.

[15]  Frances H Arnold,et al.  Directed Evolution: Bringing New Chemistry to Life , 2017, Angewandte Chemie.

[16]  Benjamin G. Davis,et al.  Functional and informatics analysis enables glycosyltransferase activity prediction , 2018, Nature Chemical Biology.

[17]  Raquel Cardoso de Melo Minardi,et al.  Identification of subfamily-specific sites based on active sites modeling and clustering , 2010, Bioinform..

[18]  Shane T. Grosser,et al.  Design of an in vitro biocatalytic cascade for the manufacture of islatravir , 2019, Science.

[19]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[20]  Philip A. Romero,et al.  Exploring protein fitness landscapes by directed evolution , 2009, Nature Reviews Molecular Cell Biology.

[21]  Tom Sercu,et al.  Transformer protein language models are unsupervised structure learners , 2020, bioRxiv.

[22]  Lorna J. Hepworth,et al.  RetroBioCat as a computer-aided synthesis planning tool for biocatalytic reactions and cascades , 2021, Nature Catalysis.

[23]  Regina Barzilay,et al.  Junction Tree Variational Autoencoder for Molecular Graph Generation , 2018, ICML.

[24]  J. Weissenbach,et al.  Revealing the hidden functional diversity of an enzyme family. , 2014, Nature chemical biology.

[25]  Brian M. Bonk,et al.  Rational design of thiolase substrate specificity for metabolic engineering applications , 2018, Biotechnology and bioengineering.

[26]  Robert D. Finn,et al.  HMMER web server: interactive sequence similarity searching , 2011, Nucleic Acids Res..

[27]  F. Arnold,et al.  Engineering cytochrome P450s for enantioselective cyclopropenation of internal alkynes. , 2020, Journal of the American Chemical Society.

[28]  Frances H. Arnold,et al.  Enzyme Engineering for Nonaqueous Solvents: Random Mutagenesis to Enhance Activity of Subtilisin E in Polar Organic Media , 1991, Bio/Technology.

[29]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[30]  Robert S. Magin,et al.  Advances in Discovering Deubiquitinating Enzyme (DUB) Inhibitors. , 2020, Journal of medicinal chemistry.

[31]  Conrad C. Huang,et al.  UCSF Chimera—A visualization system for exploratory research and analysis , 2004, J. Comput. Chem..

[32]  Peiyuan Yao,et al.  Screening and characterization of a diverse panel of metagenomic imine reductases for biocatalytic reductive amination , 2020, Nature Chemistry.

[33]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[34]  F. Hollfelder,et al.  Ultrahigh-throughput discovery of promiscuous enzymes by picodroplet functional metagenomics , 2015, Nature Communications.

[35]  Gyu Rie Lee,et al.  Accurate prediction of protein structures and interactions using a 3-track neural network , 2021, Science.

[36]  Ruibo Wu,et al.  BioNavi-NP: Biosynthesis Navigator for Natural Products , 2021, ArXiv.

[37]  U. Bornscheuer,et al.  Biocatalysis: Enzymatic Synthesis for Industrial Applications , 2020, Angewandte Chemie.

[38]  Brian Hie,et al.  Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design. , 2020, Cell systems.

[39]  Mindy I. Davis,et al.  Comprehensive analysis of kinase inhibitor selectivity , 2011, Nature Biotechnology.

[40]  Christopher A. Voigt,et al.  Retrosynthetic design of metabolic pathways to chemicals not found in nature , 2019, Current Opinion in Systems Biology.

[41]  Dan Zhao,et al.  MONN: A Multi-objective Neural Network for Predicting Compound-Protein Interactions and Affinities , 2020, Cell Systems.

[42]  Antje Chang,et al.  BRENDA , the enzyme database : updates and major new developments , 2003 .

[43]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[44]  Paul T. Kim,et al.  Deep Protein-Ligand Binding Prediction Using Unsupervised Learned Representations , 2020 .

[45]  G. Challis,et al.  New natural product biosynthetic chemistry discovered by genome mining. , 2009, Natural product reports.

[46]  Kelly G. Aukema,et al.  Machine learning-based prediction of activity and substrate specificity for OleA enzymes in the thiolase superfamily , 2020, Synthetic Biology.

[47]  K. Hult,et al.  Enzyme promiscuity: mechanism and applications. , 2007, Trends in biotechnology.

[48]  Zachary Wu,et al.  Machine learning-assisted directed protein evolution with combinatorial libraries , 2019, Proceedings of the National Academy of Sciences.

[49]  John C Whitman,et al.  Improving catalytic function by ProSAR-driven enzyme evolution , 2007, Nature Biotechnology.

[50]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[51]  Jürgen Pleiss,et al.  Determinants and Prediction of Esterase Substrate Promiscuity Patterns. , 2017, ACS chemical biology.

[52]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[53]  Neil Swainston,et al.  Selenzyme: enzyme selection tool for pathway design , 2017, bioRxiv.

[54]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.