Variational auto-encoding of protein sequences

Proteins are responsible for the most diverse set of functions in biology. The ability to extract information from protein sequences and to predict the effects of mutations is extremely valuable in many domains of biology and medicine. However the mapping between protein sequence and function is complex and poorly understood. Here we present an embedding of natural protein sequences using a Variational Auto-Encoder and use it to predict how mutations affect protein function. We use this unsupervised approach to cluster natural variants and learn interactions between sets of positions within a protein. This approach generally performs better than baseline methods that consider no interactions within sequences, and in some cases better than the state-of-the-art approaches that use the inverse-Potts model. This generative model can be used to computationally guide exploration of protein sequence space and to better inform rational and automatic protein design.

[1]  David L. Young,et al.  Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein , 2013, RNA.

[2]  Philip A. Romero,et al.  Dissecting enzyme function with microfluidic-based deep mutational scanning , 2015, Proceedings of the National Academy of Sciences.

[3]  Ágnes Tóth-Petróczy,et al.  Systematic Mapping of Protein Mutational Space by Prolonged Drift Reveals the Deleterious Effects of Seemingly Neutral Mutations , 2015, PLoS Comput. Biol..

[4]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[5]  R. Ranganathan,et al.  Evolvability as a Function of Purifying Selection in TEM-1 β-Lactamase , 2015, Cell.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Debora S. Marks,et al.  Deep generative models of genetic variation capture mutation effects , 2017, bioRxiv.

[8]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[9]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[10]  Debora S. Marks,et al.  Variational Inference for Sparse and Undirected Models , 2016, ICML.

[11]  Joseph B Hiatt,et al.  Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis , 2013, Proceedings of the National Academy of Sciences.

[12]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[13]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[14]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[15]  Thomas A. Hopf,et al.  Mutation effects predicted from sequence co-variation , 2017, Nature Biotechnology.