ProGen: Language Modeling for Protein Generation

Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations. We train a 1.2B-parameter language model, ProGen, on ∼280M protein sequences conditioned on taxonomic and keyword tags such as molecular function and cellular component. This provides ProGen with an unprecedented range of evolutionary sequence diversity and allows it to generate with fine-grained control as demonstrated by metrics based on primary sequence similarity, secondary structure accuracy, and conformational energy.

[1]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[2]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[3]  Timothy P. Lillicrap,et al.  Compressive Transformers for Long-Range Sequence Modelling , 2019, ICLR.

[4]  Zhilei Chen,et al.  A highly sensitive selection method for directed evolution of homing endonucleases , 2005, Nucleic acids research.

[5]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[6]  Namrata Anand,et al.  Generative modeling for protein structures , 2018, NeurIPS.

[7]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[8]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[9]  Mohammad Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[10]  Regina Barzilay,et al.  Generative Models for Graph-Based Protein Design , 2019, DGS@ICLR.

[11]  Lukas Zimmermann,et al.  A Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core. , 2017, Journal of molecular biology.

[12]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[13]  Leslie D. Pettit,et al.  The IUPAC stability constants database , 2006 .

[14]  Amos Bairoch,et al.  Swiss-Prot: Juggling between evolution and stability , 2004, Briefings Bioinform..

[15]  Zak Costello,et al.  How to Hallucinate Functional Proteins , 2019, 1903.00458.

[16]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[17]  George M. Church,et al.  Unified rational protein engineering with sequence-based deep representation learning , 2019, Nature Methods.

[18]  Jesse Vig,et al.  A Multiscale Visualization of Attention in the Transformer Model , 2019, ACL.

[19]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[20]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[21]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[22]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[23]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  James G. Lyons,et al.  SPIN2: Predicting sequence profiles from protein structures using deep neural networks , 2018, Proteins.

[26]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[27]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[28]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[29]  Debora S. Marks,et al.  Accelerating Protein Design Using Autoregressive Generative Models , 2019, bioRxiv.

[30]  D. Baker,et al.  The coming of age of de novo protein design , 2016, Nature.

[31]  Rolf Apweiler,et al.  UniProt archive , 2004, Bioinform..

[32]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[33]  David T. Jones,et al.  Design of metalloproteins and novel protein folds using variational autoencoders , 2018, Scientific Reports.

[34]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[35]  F. Arnold Design by Directed Evolution , 1998 .

[36]  Zachary Wu,et al.  Machine learning-assisted directed protein evolution with combinatorial libraries , 2019, Proceedings of the National Academy of Sciences.

[37]  James O Lloyd-Smith,et al.  Adaptation in protein fitness landscapes is facilitated by indirect paths , 2016, bioRxiv.

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[40]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[41]  L. Holm,et al.  The Pfam protein families database , 2005, Nucleic Acids Res..

[42]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[43]  Wouter Boomsma,et al.  Spherical convolutions and their application in molecular modelling , 2017, NIPS.

[44]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.