Deep generative models of genetic variation capture mutation effects

The functions of proteins and RNAs are determined by a myriad of interactions between their constituent residues, but most quantitative models of how molecular phenotype depends on genotype must approximate this by simple additive effects. While recent models have relaxed this constraint to also account for pairwise interactions, these approaches do not provide a tractable path towards modeling higher-order epistasis. Here, we show how latent variable models with nonlinear dependencies can be applied to capture beyond-pairwise constraints in biomolecules. We present a new probabilistic model for sequence families, DeepSequence, that can predict the effects of mutations across a variety of deep mutational scanning experiments significantly better than site independent or pairwise models that are based on the same evolutionary data. The model, learned in an unsupervised manner solely from sequence information, is grounded with biologically motivated priors, reveals latent organization of sequence families, and can be used to extrapolate to new parts of sequence space.

[1]  Benjamin P. Roscoe,et al.  Systematic exploration of ubiquitin sequence, E1 activation efficiency, and experimental fitness in yeast. , 2014, Journal of molecular biology.

[2]  B. Rost,et al.  Better prediction of functional effects for sequence variants , 2015, BMC Genomics.

[3]  Maitreya J. Dunham,et al.  Variant Interpretation: Functional Assays to the Rescue. , 2017, American journal of human genetics.

[4]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[5]  Joseph B Hiatt,et al.  Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis , 2013, Proceedings of the National Academy of Sciences.

[6]  J. Poulain,et al.  Capturing the mutational landscape of the beta-lactamase TEM-1 , 2013, Proceedings of the National Academy of Sciences.

[7]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[8]  Michael B. Doud,et al.  Accurate Measurement of the Effects of All Amino-Acid Mutations on Influenza Hemagglutinin , 2016, Viruses.

[9]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[10]  P. Smouse,et al.  genalex 6: genetic analysis in Excel. Population genetic software for teaching and research , 2006 .

[11]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[12]  Jason W. Labonte,et al.  Erratum: A Comprehensive, High-Resolution Map of a Gene's Fitness Landscape (Molecular Biology and Evolution (2014) 31 (1581-1592) DOI: 10.1093/molbev/msu081) , 2016 .

[13]  J. Valcárcel,et al.  The complete local genotype–phenotype landscape for the alternative splicing of a human exon , 2016, Nature Communications.

[14]  Thomas A. Hopf,et al.  Mutation effects predicted from sequence co-variation , 2017, Nature Biotechnology.

[15]  Martin J. Wainwright,et al.  Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions , 2011, ICML.

[16]  F. J. Poelwijk,et al.  The spatial architecture of protein function and adaptation , 2012, Nature.

[17]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[18]  Eric D. Kelsic,et al.  RNA Structural Determinants of Optimal Codons Revealed by MAGE-Seq. , 2016, Cell systems.

[19]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[20]  David T. Jones,et al.  MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins , 2014, Bioinform..

[21]  K. Hamacher,et al.  Three-body interactions improve contact prediction within direct-coupling analysis. , 2017, Physical review. E.

[22]  Kelly M. Thayer,et al.  Analyses of the effects of all ubiquitin point mutants on yeast growth rate. , 2013, Journal of molecular biology.

[23]  Kyle A. Barlow,et al.  Determination of ubiquitin fitness landscapes under different chemical stresses in a classroom setting , 2015, bioRxiv.

[24]  Dmitry Chudakov,et al.  Local fitness landscape of the green fluorescent protein , 2016, Nature.

[25]  Ryan P. Adams,et al.  Composing graphical models with neural networks for structured representations and fast inference , 2016, NIPS.

[26]  J. Shendure,et al.  The power of multiplexed functional analysis of genetic variants , 2016, Nature Protocols.

[27]  Eric J. Hayden,et al.  Negative Epistasis in Experimental RNA Fitness Landscapes , 2017, Journal of Molecular Evolution.

[28]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[29]  A. Chakraborty,et al.  Deconstruction of the Ras switching cycle through saturation mutagenesis , 2017, eLife.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[32]  Jesse D. Bloom,et al.  Accurate measurement of the effects of all amino-acid mutations to influenza hemagglutinin , 2016, bioRxiv.

[33]  Zoubin Ghahramani,et al.  A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[34]  T. Mikkelsen,et al.  Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes , 2014, Nucleic acids research.

[35]  S. Fields,et al.  A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function , 2012, Proceedings of the National Academy of Sciences.

[36]  Kathryn A. Whitehead,et al.  Structure-Function Analysis of Phenylpiperazine Derivatives as Intestinal Permeation Enhancers , 2017, Pharmaceutical Research.

[37]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[38]  J. Kitzman,et al.  Massively Parallel Single Amino Acid Mutagenesis , 2014, Nature Methods.

[39]  M. Laub,et al.  Evolving New Protein-Protein Interaction Specificity through Promiscuous Intermediates , 2015, Cell.

[40]  John P. Barton,et al.  The Fitness Landscape of HIV-1 Gag: Advanced Modeling Approaches and Validation of Model Predictions by In Vitro Testing , 2014, PLoS Comput. Biol..

[41]  R. Sun,et al.  A Quantitative High-Resolution Genetic Profile Rapidly Identifies Sequence Determinants of Hepatitis C Viral Fitness and Drug Sensitivity , 2014, PLoS pathogens.

[42]  D. Baker,et al.  High Resolution Mapping of Protein Sequence–Function Relationships , 2010, Nature Methods.

[43]  R. Sun,et al.  Functional Constraint Profiling of a Viral Protein Reveals Discordance of Evolutionary Conservation and Functionality , 2015, PLoS genetics.

[44]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[45]  C. Wilke,et al.  Biophysical models of protein evolution: Understanding the patterns of evolutionary sequence divergence , 2016, bioRxiv.

[46]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[47]  Jing Hu,et al.  SIFT web server: predicting effects of amino acid substitutions on proteins , 2012, Nucleic Acids Res..

[48]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[49]  R. Gibbs,et al.  Deep sequencing of systematic combinatorial libraries reveals β-lactamase sequence constraints at high resolution. , 2012, Journal of molecular biology.

[50]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[51]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[52]  D. Bolon,et al.  Systematic Mutant Analyses Elucidate General and Client-Specific Aspects of Hsp90 Function. , 2016, Cell reports.

[53]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[54]  Konstantin B. Zeldovich,et al.  Latent Effects of Hsp90 Mutants Revealed at Reduced Expression Levels , 2013, PLoS genetics.

[55]  S. Fields,et al.  Deep mutational scanning: a new style of protein science , 2014, Nature Methods.

[56]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.

[57]  M. Weigt,et al.  Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1 , 2015, bioRxiv.

[58]  David L. Young,et al.  Massively Parallel Functional Analysis of BRCA1 RING Domain Variants , 2015, Genetics.

[59]  Jesse D Bloom,et al.  The inherent mutational tolerance and antigenic evolvability of influenza hemagglutinin , 2014, bioRxiv.

[60]  Philip A. Romero,et al.  Dissecting enzyme function with microfluidic-based deep mutational scanning , 2015, Proceedings of the National Academy of Sciences.

[61]  Ágnes Tóth-Petróczy,et al.  Systematic Mapping of Protein Mutational Space by Prolonged Drift Reveals the Deleterious Effects of Seemingly Neutral Mutations , 2015, PLoS Comput. Biol..

[62]  David E. Kim,et al.  Large-scale determination of previously unsolved protein structures using evolutionary information , 2015, eLife.

[63]  Robert B. Heckendorn,et al.  Should evolutionary geneticists worry about higher-order epistasis? , 2013, Current opinion in genetics & development.

[64]  I JordanMichael,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008 .

[65]  Elena R. Lozovsky,et al.  Biophysical principles predict fitness landscapes of drug resistance , 2016, Proceedings of the National Academy of Sciences.

[66]  David L. Young,et al.  Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein , 2013, RNA.

[67]  Martin A. Nowak,et al.  Variational auto-encoding of protein sequences , 2017, ArXiv.

[68]  Robert D. Finn,et al.  HMMER web server: 2015 update , 2015, Nucleic Acids Res..

[69]  Jianzhi Zhang,et al.  The fitness landscape of a tRNA gene , 2016, Science.

[70]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[71]  Adam J. Riesselman,et al.  3D RNA and Functional Interactions from Evolutionary Couplings , 2015, Cell.

[72]  A. Siepel,et al.  Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data , 2016, Nature Genetics.

[73]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[74]  R. Ranganathan,et al.  Evolvability as a Function of Purifying Selection in TEM-1 β-Lactamase , 2015, Cell.

[75]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[76]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[77]  I. Adzhubei,et al.  Predicting Functional Effect of Human Missense Mutations Using PolyPhen‐2 , 2013, Current protocols in human genetics.

[78]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[79]  G. Church,et al.  Large-scale de novo DNA synthesis: technologies and applications , 2014, Nature Methods.

[80]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[81]  Dan S. Tawfik,et al.  Quantifying and understanding the fitness effects of protein mutations: Laboratory versus nature , 2016, Protein science : a publication of the Protein Society.

[82]  Thomas A. Hopf,et al.  Structured States of Disordered Proteins from Genomic Sequences , 2016, Cell.