Accelerating Protein Design Using Autoregressive Generative Models

A major biomedical challenge is the interpretation of genetic variation and the ability to design functional novel sequences. Since the space of all possible genetic variation is enormous, there is a concerted effort to develop reliable methods that can capture genotype to phenotype maps. State-of-art computational methods rely on models that leverage evolutionary information and capture complex interactions between residues. However, current methods are not suitable for a large number of important applications because they depend on robust protein or RNA alignments. Such applications include genetic variants with insertions and deletions, disordered proteins, and functional antibodies. Ideally, we need models that do not rely on assumptions made by multiple sequence alignments. Here we borrow from recent advances in natural language processing and speech synthesis to develop a generative deep neural network-powered autoregressive model for biological sequences that captures functional constraints without relying on an explicit alignment structure. Application to unseen experimental measurements of 42 deep mutational scans predicts the effect of insertions and deletions while matching state-of-art missense mutation prediction accuracies. We then test the model on single domain antibodies, or nanobodies, a complex target for alignment-based models due to the highly variable complementarity determining regions. We fit the model to a naïve llama immune repertoire and generate a diverse, optimized library of 105 nanobody sequences for experimental validation. Our results demonstrate the power of the ‘alignment-free’ autoregressive model in mutation effect prediction and design of traditionally challenging sequence families.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Brendan J. Frey,et al.  Generating and designing DNA with deep generative models , 2017, ArXiv.

[4]  Tilman Flock,et al.  Exploiting sequence and stability information for directing nanobody stability engineering , 2017, Biochimica et biophysica acta. General subjects.

[5]  Alexander M. Rush,et al.  Dilated Convolutions for Modeling Long-Distance Genomic Dependencies , 2017, bioRxiv.

[6]  S. Fields,et al.  Deep mutational scanning: a new style of protein science , 2014, Nature Methods.

[7]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[8]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[9]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[10]  Robin A. Weiss,et al.  Molecular Evolution of Broadly Neutralizing Llama Antibodies to the CD4-Binding Site of HIV-1 , 2014, PLoS pathogens.

[11]  Debora S Marks,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[12]  Serge Muyldermans,et al.  Nanobodies: natural single-domain antibodies. , 2013, Annual review of biochemistry.

[13]  John P. Barton,et al.  The Fitness Landscape of HIV-1 Gag: Advanced Modeling Approaches and Validation of Model Predictions by In Vitro Testing , 2014, PLoS Comput. Biol..

[14]  Myle Ott,et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences , 2019, Proceedings of the National Academy of Sciences.

[15]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[16]  Johannes Söding,et al.  Clustering huge protein sequence sets in linear time , 2018 .

[17]  Taylor L. Mighell,et al.  A saturation mutagenesis approach to understanding PTEN lipid phosphatase activity and genotype-phenotypes relationships , 2018, bioRxiv.

[18]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[19]  Zhiting Hu,et al.  Improved Variational Autoencoders for Text Modeling using Dilated Convolutions , 2017, ICML.

[20]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Susanne Müller,et al.  Generation and analyses of human synthetic antibody libraries and their application for protein microarrays. , 2016, Protein engineering, design & selection : PEDS.

[23]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[24]  George M. Church,et al.  Unified rational protein engineering with sequence-only deep representation learning , 2019, bioRxiv.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[27]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[28]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[30]  Alexander M. Rush,et al.  Semi-Amortized Variational Autoencoders , 2018, ICML.

[31]  Guillaume J. Filion,et al.  Experimental assay of a fitness landscape on a macroevolutionary scale , 2017, bioRxiv.

[32]  Thomas A. Hopf,et al.  Mutation effects predicted from sequence co-variation , 2017, Nature Biotechnology.

[33]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[34]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[35]  Zak Costello,et al.  How to Hallucinate Functional Proteins , 2019, 1903.00458.

[36]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[37]  Guido Sanguinetti,et al.  Network of epistatic interactions within a yeast snoRNA , 2016, Science.

[38]  Conor McMahon,et al.  Yeast surface display platform for rapid discovery of conformationally selective nanobodies , 2018, Nature Structural & Molecular Biology.

[39]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[40]  G. Church,et al.  Large-scale de novo DNA synthesis: technologies and applications , 2014, Nature Methods.

[41]  David T. Jones,et al.  Design of metalloproteins and novel protein folds using variational autoencoders , 2018, Scientific Reports.

[42]  Regina Barzilay,et al.  Generative Models for Graph-Based Protein Design , 2019, DGS@ICLR.