ProT-VAE: Protein Transformer Variational AutoEncoder for Functional Protein Design

The data-driven design of protein sequences with desired function is challenged by the absence of good theoretical models for the sequence-function mapping and the vast size of protein sequence space. Deep generative models have demonstrated success in learning the sequence to function relationship over natural training data and sampling from this distribution to design synthetic sequences with engineered functionality. We introduce a deep generative model termed the Protein Transformer Variational AutoEncoder (ProT-VAE) that furnishes an accurate, generative, fast, and transferable model of the sequence-function relationship for data-driven protein engineering by blending the merits of variational autoencoders to learn interpretable, low-dimensional latent embeddings and fully generative decoding for conditional sequence design with the expressive, alignment-free featurization offered by transformers. The model sandwiches a lightweight, task-specific variational autoencoder between generic, pre-trained transformer encoder and decoder stacks to admit alignment-free training in an unsupervised or semi-supervised fashion, and interpretable low-dimensional latent spaces that facilitate understanding, optimization, and generative design of functional synthetic sequences. We implement the model using NVIDIA’s BioNeMo framework and validate its performance in retrospective functional prediction and prospective design of novel protein sequences subjected to experimental synthesis and testing. The ProT-VAE latent space exposes ancestral and functional relationships that enable conditional generation of novel sequences with high functionality and substantial sequence diversity. We anticipate that the model can offer an extensible and generic platform for machine learning-guided directed evolution campaigns for the data-driven design of novel synthetic proteins with “super-natural” function.

[1]  Niksa Praljak,et al.  Deep learning-enabled design of synthetic orthologs of a signaling protein , 2022, bioRxiv.

[2]  M. Hecht,et al.  A de novo protein catalyzes the synthesis of semiconductor quantum dots , 2022, Proceedings of the National Academy of Sciences of the United States of America.

[3]  John Ingraham,et al.  Illuminating protein space with a programmable generative model , 2022, bioRxiv.

[4]  Llion Jones,et al.  ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Kevin B. Givechian,et al.  Transformer-based protein generation with regularized latent space optimization , 2022, Nature Machine Intelligence.

[6]  J. Henderson,et al.  A Variational AutoEncoder for Transformers with Nonparametric Variational Information Bottleneck , 2022, ArXiv.

[7]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[8]  Aidan N. Gomez,et al.  Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval , 2022, ICML.

[9]  Philip A. Romero,et al.  Machine learning to navigate fitness landscapes for protein engineering. , 2022, Current opinion in biotechnology.

[10]  K. Nakai,et al.  Protein design via deep learning , 2022, Briefings Bioinform..

[11]  A. Dousis,et al.  Therapeutic enzyme engineering using a generative neural network , 2022, Scientific Reports.

[12]  B. Höcker,et al.  Controllable protein design with language models , 2022, Nature Machine Intelligence.

[13]  Kevin K. Yang,et al.  FLIP: Benchmark tasks in fitness landscape inference for proteins , 2021, bioRxiv.

[14]  Søren Hauberg,et al.  Learning meaningful representations of protein sequences , 2020, Nature Communications.

[15]  Wesley Wei Qian,et al.  ECNet is an evolutionary context-integrated deep learning framework for protein engineering , 2021, Nature Communications.

[16]  P. Bolhuis,et al.  An extended autoencoder model for reaction coordinate discovery in rare event molecular dynamics datasets. , 2021, The Journal of chemical physics.

[17]  Seongmin Park,et al.  Finetuning Pretrained Transformers into Variational Autoencoders , 2021, INSIGHTS.

[18]  Zachary Z. Sun,et al.  Deep neural language modeling enables functional protein generation across families , 2021, bioRxiv.

[19]  Oriol Vinyals,et al.  Highly accurate protein structure prediction with AlphaFold , 2021, Nature.

[20]  B. Berger,et al.  Learning the protein language: Evolution, structure, and function. , 2021, Cell systems.

[21]  Minkyung Baek,et al.  Protein tertiary structure prediction and refinement using deep learning and Rosetta in CASP14 , 2021, Proteins.

[22]  R. Kolodny,et al.  How Deep Learning Tools Can Help Protein Engineers Find Good Sequences. , 2021, The journal of physical chemistry. B.

[23]  A. Keating,et al.  Data-driven computational protein design. , 2021, Current opinion in structural biology.

[24]  Janis Postels,et al.  Variational Transformer Networks for Layout Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Zachary Wu,et al.  Advances in machine learning for directed evolution. , 2021, Current opinion in structural biology.

[26]  Rama Ranganathan,et al.  100th Anniversary of Macromolecular Science Viewpoint: Data-Driven Protein Design. , 2021, ACS macro letters.

[27]  Aleksej Zelezniak,et al.  Expanding functional protein sequence space using generative adversarial networks , 2019, bioRxiv.

[28]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[29]  Simona Cocco,et al.  An evolution-based model for designing chorismate mutase enzymes , 2020, Science.

[30]  Christopher N. Anderson,et al.  Transformer VAE: A Hierarchical Model for Structure-Aware and Interpretable Music Representation Learning , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  David Dohan,et al.  Model-based reinforcement learning for biological sequence design , 2020, ICLR.

[32]  Nicholas Marshall,et al.  A Deep Dive into Machine Learning Models for Protein Engineering. , 2020, Journal of chemical information and modeling.

[33]  Xiujun Li,et al.  Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space , 2020, EMNLP.

[34]  Nikhil Naik,et al.  ProGen: Language Modeling for Protein Generation , 2020, bioRxiv.

[35]  Jiří Damborský,et al.  Machine Learning in Enzyme Engineering , 2019, ACS Catalysis.

[36]  Xinqiang Ding,et al.  Deciphering protein evolution and fitness landscapes with latent space models , 2019, Nature Communications.

[37]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[38]  Debora S. Marks,et al.  Accelerating Protein Design Using Autoregressive Generative Models , 2019, bioRxiv.

[39]  Xiaojun Wan,et al.  T-CVAE: Transformer-Based Conditioned Variational Autoencoder for Story Completion , 2019, IJCAI.

[40]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[41]  John Canny,et al.  Evaluating Protein Transfer Learning with TAPE , 2019, bioRxiv.

[42]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[43]  Zak Costello,et al.  How to Hallucinate Functional Proteins , 2019, 1903.00458.

[44]  F. Arnold,et al.  Machine-learning-guided directed evolution for protein engineering , 2018, Nature Methods.

[45]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[46]  Adam J. Riesselman,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[47]  J. Söding,et al.  Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold , 2018, bioRxiv.

[48]  David T. Jones,et al.  Design of metalloproteins and novel protein folds using variational autoencoders , 2018, Scientific Reports.

[49]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[50]  J. Söding,et al.  Clustering huge protein sequence sets in linear time , 2018, bioRxiv.

[51]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[54]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[55]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[56]  Aurora Martínez,et al.  Phenylalanine hydroxylase: Function, structure, and regulation , 2013, IUBMB life.

[57]  Peter B. McGarvey,et al.  UniRef: comprehensive and non-redundant UniProt reference clusters , 2007, Bioinform..

[58]  H. A. Orr,et al.  The distribution of fitness effects among beneficial mutations in Fisher's geometric model of adaptation. , 2006, Journal of theoretical biology.

[59]  D. Axe Estimating the prevalence of protein sequences adopting functional enzyme folds. , 2004, Journal of molecular biology.

[60]  Wendell A. Lim,et al.  Optimization of specificity in a cellular protein interaction network by negative selection , 2003, Nature.

[61]  B. Mayer,et al.  SH3 domains: complexity in moderation. , 2001, Journal of cell science.

[62]  R. Stevens,et al.  Structure of Tetrameric Human Phenylalanine Hydroxylase and Its Implications for Phenylketonuria* , 1998, The Journal of Biological Chemistry.

[63]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[64]  Andrea Musacchio,et al.  Crystal structure of a Src-homology 3 (SH3) domain , 1992, Nature.

[65]  K. Nagai,et al.  Coordinated amino acid changes in homologous protein families. , 1988, Protein engineering.

[66]  H. Kröger,et al.  [Protein synthesis]. , 1974, Fortschritte der Medizin.

[67]  J. Maynard Smith Natural Selection and the Concept of a Protein Space , 1970 .