Deep neural language modeling enables functional protein generation across families

Bypassing nature’s evolutionary trajectory, de novo protein generation—defined as creating artificial protein sequences from scratch—could enable breakthrough solutions for biomedical and environmental challenges. Viewing amino acid sequences as a language, we demonstrate that a deep learning-based language model can generate functional artificial protein sequences across families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. Our protein language model is trained by simply learning to predict the next amino acid for over 280 million protein sequences from thousands of protein families, without biophysical or coevolutionary modeling. We experimentally evaluate model-generated artificial proteins on five distinct antibacterial lysozyme families. Artificial proteins show similar activities and catalytic efficiencies as representative natural lysozymes, including hen egg white lysozyme, while reaching as low as 44% identity to any known naturally-evolved protein. The X-ray crystal structure of an enzymatically active artificial protein recapitulates the conserved fold and positioning of active site residues found in natural proteins. We demonstrate our language model’s ability to be adapted to different protein families by accurately predicting the functionality of artificial chorismate mutase and malate dehydrogenase proteins. These results indicate that neural language models successfully perform de novo protein generation across protein families and may prove to be a tool to shortcut evolution.

[1]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[2]  Yi Yan Yang,et al.  Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations , 2021, Nature Biomedical Engineering.

[3]  Lucy J. Colwell,et al.  Deep diversification of an AAV capsid protein by machine learning , 2021, Nature Biotechnology.

[4]  M. Mirdita,et al.  Fast and sensitive taxonomic assignment to metagenomic contigs , 2020, bioRxiv.

[5]  Simona Cocco,et al.  An evolution-based model for designing chorismate mutase enzymes , 2020, Science.

[6]  Shana Poplack,et al.  Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching1 , 1980 .

[7]  B. Rost,et al.  ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing , 2020, bioRxiv.

[8]  F. Arnold,et al.  Signal Peptides Generated by Attention-Based Neural Networks. , 2020, ACS synthetic biology.

[9]  Lav R. Varshney,et al.  BERTology Meets Biology: Interpreting Attention in Protein Language Models , 2020, bioRxiv.

[10]  Soon Wen Hoh,et al.  Current approaches for automated model building into cryo-EM maps using Buccaneer with CCP-EM , 2020, Acta crystallographica. Section D, Structural biology.

[11]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[12]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[13]  Ethan C. Alley,et al.  Low-N protein engineering with data-efficient deep learning , 2020, Nature Methods.

[14]  Jianyi Yang,et al.  Improved protein structure prediction using predicted interresidue orientations , 2019, Proceedings of the National Academy of Sciences.

[15]  Aleksej Zelezniak,et al.  Expanding functional protein sequence space using generative adversarial networks , 2019, bioRxiv.

[16]  J. Yosinski,et al.  Plug and Play Language Models: A Simple Approach to Controlled Text Generation , 2019, ICLR.

[17]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[18]  Adam J. Riesselman,et al.  Protein design and variant prediction using autoregressive generative models , 2019, Nature Communications.

[19]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[20]  Thomas Wolf,et al.  Transfer Learning in Natural Language Processing , 2019, NAACL.

[21]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[22]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[23]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[24]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[25]  A. Buckle,et al.  Catalytic diversity and cell wall binding repeats in the phage‐encoded endolysins , 2018, Molecular microbiology.

[26]  Robert P. Sheridan,et al.  The EVcouplings Python framework for coevolutionary sequence analysis , 2018, bioRxiv.

[27]  Fei Long,et al.  Overview of refinement procedures within REFMAC5: utilizing data from different sources , 2018, Acta crystallographica. Section D, Structural biology.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  D. Baker,et al.  The coming of age of de novo protein design , 2016, Nature.

[30]  Alexei A. Efros,et al.  What makes ImageNet good for transfer learning? , 2016, ArXiv.

[31]  Robert A. Langan,et al.  De novo design of protein homo-oligomers with modular hydrogen-bond network–mediated specificity , 2016, Science.

[32]  D. Baker,et al.  De novo design of a four-fold symmetric TIM-barrel protein with atomic-level accuracy , 2015, Nature chemical biology.

[33]  D. Baker,et al.  Control over overall shape and size in de novo designed proteins , 2015, Proceedings of the National Academy of Sciences.

[34]  Debora S. Marks,et al.  Inferring Pairwise Interactions from Biological Data Using Maximum-Entropy Probability Models , 2015, PLoS Comput. Biol..

[35]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[36]  Deok-Soo Kim,et al.  BetaCavityWeb: a webserver for molecular voids and channels , 2015, Nucleic Acids Res..

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  D. Baker,et al.  Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information , 2014, eLife.

[39]  Richard M. Murray,et al.  Protocols for Implementing an Escherichia coli Based TX-TL Cell-Free Expression System for Synthetic Biology , 2013, Journal of visualized experiments : JoVE.

[40]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[41]  D. Baker,et al.  Principles for designing ideal protein structures , 2012, Nature.

[42]  P. Zwart,et al.  Towards automated crystallographic structure refinement with phenix.refine , 2012, Acta crystallographica. Section D, Biological crystallography.

[43]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[44]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[45]  Dale E Tronrud,et al.  Lessons from the lysozyme of phage T4 , 2010, Protein science : a publication of the Protein Society.

[46]  B. Matthews,et al.  Evaluation at atomic resolution of the role of strain in destabilizing the temperature‐sensitive T4 lysozyme mutant Arg 96 → His , 2009, Protein science : a publication of the Protein Society.

[47]  Randy J. Read,et al.  Dauter Iterative model building , structure refinement and density modification with the PHENIX AutoBuild wizard , 2007 .

[48]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[49]  Randy J. Read,et al.  Phaser crystallographic software , 2007, Journal of applied crystallography.

[50]  Rolf Apweiler,et al.  UniProt archive , 2004, Bioinform..

[51]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[52]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[53]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[54]  K. J. Oh,et al.  Conformation of T4 lysozyme in solution. Hinge-bending motion and the substrate-induced conformational transition studied by site-directed spin labeling. , 1997, Biochemistry.

[55]  B. Matthews,et al.  A covalent enzyme-substrate intermediate with saccharide distortion in a mutant T4 lysozyme. , 1993, Science.

[56]  Paul Martin,et al.  POTTS MODELS AND RELATED PROBLEMS IN STATISTICAL MECHANICS , 1991 .

[57]  Carol Pfaff Constraints on Language Mixing: Intrasentential Code-Switching and Borrowing in Spanish/English , 1979 .

[58]  Shana Poplack,et al.  Sometimes I'll Start a Sentence in Spanish Y Termino En Espanol: toward a Typology of Code-switching 1 , 2010 .

[59]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[60]  R. Dobson,et al.  On the catalytic mechanism of bacteriophage endolysins: Opportunities for engineering. , 2019, Biochimica et biophysica acta. Proteins and proteomics.

[61]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[62]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[63]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[64]  P. Emsley,et al.  Features and development of Coot , 2010, Acta crystallographica. Section D, Biological crystallography.

[65]  Leslie D. Pettit,et al.  The IUPAC stability constants database , 2006 .

[66]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2005, Nucleic Acids Res..