Mutation effects predicted from sequence co-variation

Many high-throughput experimental technologies have been developed to assess the effects of large numbers of mutations (variation) on phenotypes. However, designing functional assays for these methods is challenging, and systematic testing of all combinations is impossible, so robust methods to predict the effects of genetic variation are needed. Most prediction methods exploit evolutionary sequence conservation but do not consider the interdependencies of residues or bases. We present EVmutation, an unsupervised statistical method for predicting the effects of mutations that explicitly captures residue dependencies between positions. We validate EVmutation by comparing its predictions with outcomes of high-throughput mutagenesis experiments and measurements of human disease mutations and show that it outperforms methods that do not account for epistasis. EVmutation can be used to assess the quantitative effects of mutations in genes of any organism. We provide pre-computed predictions for ∼7,000 human proteins at http://evmutation.org/.

[1]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[2]  L. Landau,et al.  statistical-physics-part-1 , 1958 .

[3]  J. Besag Statistical Analysis of Non-Lattice Data , 1975 .

[4]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[5]  Wanzhi Huang,et al.  A natural polymorphism in beta-lactamase is a global suppressor. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[6]  G. Stormo,et al.  Correlated mutations in models of protein sequences: phylogenetic and structural effects , 1999 .

[7]  A S Lapedes,et al.  Superadditive correlation. , 1999, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[8]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[9]  Masatoshi Nei,et al.  Origin and evolution of influenza virus hemagglutinin genes. , 2002, Molecular biology and evolution.

[10]  Brian K Shoichet,et al.  The Structural Bases of Antibiotic Resistance in the Clinically Derived Mutant β-Lactamases TEM-30, TEM-32, and TEM-34* , 2002, The Journal of Biological Chemistry.

[11]  Stefan M. Larson,et al.  The relationship between conservation, thermodynamic stability, and function in the SH3 domain hydrophobic core. , 2003, Journal of molecular biology.

[12]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[13]  A. E. Hirsh,et al.  The application of statistical physics to evolutionary biology. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Emidio Capriotti,et al.  Bioinformatics Original Paper Predicting the Insurgence of Human Genetic Diseases Associated to Single Point Protein Mutations with Support Vector Machines and Evolutionary Information , 2022 .

[15]  F. Arnold,et al.  Protein stability promotes evolvability. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[16]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[17]  C. Sander,et al.  Determinants of protein function revealed by combinatorial entropy optimization , 2007, Genome Biology.

[18]  Brian E. Granger,et al.  IPython: A System for Interactive Scientific Computing , 2007, Computing in Science & Engineering.

[19]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[20]  SNAP predicts effect of mutations on protein function , 2008, Bioinform..

[21]  Andreas Wagner,et al.  Neutralism and selectionism: a network-based reconciliation , 2008, Nature Reviews Genetics.

[22]  Michael R Kosorok On Brownian Distance Covariance and High Dimensional Data. , 2009, The annals of applied statistics.

[23]  M. Kosorok Discussion of: Brownian distance covariance , 2009, 1010.0822.

[24]  Maria L. Rizzo,et al.  Brownian distance covariance , 2009, 1010.0297.

[25]  Jesse D. Bloom,et al.  Inferring Stabilizing Mutations from Protein Phylogenies: Application to Influenza Hemagglutinin , 2009, PLoS Comput. Biol..

[26]  T. Hwa,et al.  Identification of direct residue contacts in protein–protein interaction by message passing , 2009, Proceedings of the National Academy of Sciences.

[27]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[28]  Najeeb M. Halabi,et al.  Protein Sectors: Evolutionary Units of Three-Dimensional Structure , 2009, Cell.

[29]  Dan S. Tawfik,et al.  Stability effects of mutations and protein evolvability. , 2009, Current opinion in structural biology.

[30]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[31]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[32]  D. Baker,et al.  High Resolution Mapping of Protein Sequence–Function Relationships , 2010, Nature Methods.

[33]  W. Hoff,et al.  Robustness and evolvability in the functional anatomy of a PER-ARNT-SIM (PAS) domain , 2010, Proceedings of the National Academy of Sciences.

[34]  C. Sander,et al.  Predicting the functional impact of protein mutations: application to cancer genomics , 2011, Nucleic acids research.

[35]  Thomas A. Hopf,et al.  Protein 3D Structure Computed from Evolutionary Sequence Variation , 2011, PloS one.

[36]  Eugene I. Shakhnovich,et al.  Soluble oligomerization provides a beneficial fitness effect on destabilizing mutations. , 2011 .

[37]  J. Dushoff,et al.  Prevalence of Epistasis in the Evolution of Influenza A Surface Proteins , 2011, PLoS genetics.

[38]  Sivaraman Balakrishnan,et al.  Learning generative models for protein fold families , 2011, Proteins.

[39]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[40]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[41]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[42]  C. Sander,et al.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families , 2011, Proceedings of the National Academy of Sciences.

[43]  Jing Hu,et al.  SIFT web server: predicting effects of amino acid substitutions on proteins , 2012, Nucleic Acids Res..

[44]  F. J. Poelwijk,et al.  The spatial architecture of protein function and adaptation , 2012, Nature.

[45]  R. Gibbs,et al.  Deep sequencing of systematic combinatorial libraries reveals β-lactamase sequence constraints at high resolution. , 2012, Journal of molecular biology.

[46]  S. Fields,et al.  A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function , 2012, Proceedings of the National Academy of Sciences.

[47]  Christopher Jarzynski,et al.  Using Sequence Alignments to Predict Protein Structure and Stability With High Accuracy , 2012, 1207.2484.

[48]  Massimiliano Pontil,et al.  PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments , 2012, Bioinform..

[49]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.

[50]  R. Goldstein,et al.  Amino acid coevolution induces an evolutionary Stokes shift , 2012, Proceedings of the National Academy of Sciences.

[51]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[52]  Michael S. Breen,et al.  Epistasis as the primary factor in molecular evolution , 2012, Nature.

[53]  Arup K Chakraborty,et al.  Spin models inferred from patient-derived viral sequence data faithfully describe HIV fitness landscapes. , 2013, Physical review. E, Statistical, nonlinear, and soft matter physics.

[54]  D. Baker,et al.  Assessing the utility of coevolution-based residue–residue contact predictions in a sequence- and structure-rich era , 2013, Proceedings of the National Academy of Sciences.

[55]  J. Poulain,et al.  Capturing the mutational landscape of the beta-lactamase TEM-1 , 2013, Proceedings of the National Academy of Sciences.

[56]  E. Aurell,et al.  Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. , 2012, Physical review. E, Statistical, nonlinear, and soft matter physics.

[57]  M. Harms,et al.  Evolutionary biochemistry: revealing the historical and physical causes of protein properties , 2013, Nature Reviews Genetics.

[58]  Maria Jesus Martin,et al.  SIFTS: Structure Integration with Function, Taxonomy and Sequences resource , 2012, Nucleic Acids Res..

[59]  Robert B. Heckendorn,et al.  Should evolutionary geneticists worry about higher-order epistasis? , 2013, Current opinion in genetics & development.

[60]  Jie Zhang,et al.  Analysis of BRCA1 Variants in Double‐Strand Break Repair by Homologous Recombination and Single‐Strand Annealing , 2013, Human mutation.

[61]  David L. Young,et al.  Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein , 2013, RNA.

[62]  Marc A Suchard,et al.  Stability-mediated epistasis constrains the evolution of an influenza protein , 2013, eLife.

[63]  I. Adzhubei,et al.  Predicting Functional Effect of Human Missense Mutations Using PolyPhen‐2 , 2013, Current protocols in human genetics.

[64]  Guido Tiana,et al.  The network of stabilizing contacts in proteins studied by coevolutionary data. , 2013, The Journal of chemical physics.

[65]  Joseph B Hiatt,et al.  Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis , 2013, Proceedings of the National Academy of Sciences.

[66]  Kelly M. Thayer,et al.  Analyses of the effects of all ubiquitin point mutants on yeast growth rate. , 2013, Journal of molecular biology.

[67]  Jason W. Labonte,et al.  A Comprehensive, High-Resolution Map of a Gene’s Fitness Landscape , 2014, Molecular biology and evolution.

[68]  S. Fields,et al.  Deep mutational scanning: a new style of protein science , 2014, Nature Methods.

[69]  Thomas A. Hopf,et al.  Sequence co-evolution gives 3D contacts and structures of protein complexes , 2014, eLife.

[70]  R. Sun,et al.  A Quantitative High-Resolution Genetic Profile Rapidly Identifies Sequence Determinants of Hepatitis C Viral Fitness and Drug Sensitivity , 2014, PLoS pathogens.

[71]  David T. Jones,et al.  De Novo Structure Prediction of Globular Proteins Aided by Sequence Variation-Derived Contacts , 2014, PloS one.

[72]  D. Baker,et al.  Robust and accurate prediction of residue–residue interactions across protein interfaces using evolutionary information , 2014, eLife.

[73]  T. Mikkelsen,et al.  Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes , 2014, Nucleic acids research.

[74]  Jesse D Bloom,et al.  The inherent mutational tolerance and antigenic evolvability of influenza hemagglutinin , 2014, bioRxiv.

[75]  Nicholas C. Wu,et al.  A Comprehensive Biophysical Description of Pairwise Epistasis throughout an Entire Protein Domain , 2014, Current Biology.

[76]  Benjamin P. Roscoe,et al.  Systematic exploration of ubiquitin sequence, E1 activation efficiency, and experimental fitness in yeast. , 2014, Journal of molecular biology.

[77]  John P. Barton,et al.  The Fitness Landscape of HIV-1 Gag: Advanced Modeling Approaches and Validation of Model Predictions by In Vitro Testing , 2014, PLoS Comput. Biol..

[78]  R. Goldstein,et al.  Strong evidence for protein epistasis, weak evidence against it , 2014, Proceedings of the National Academy of Sciences.

[79]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[80]  Jay Shendure,et al.  Saturation Editing of Genomic Regions by Multiplex Homology-Directed Repair , 2014, Nature.

[81]  Colin A Russell,et al.  Predicting evolution from the shape of genealogical trees , 2014, eLife.

[82]  M. Lässig,et al.  A predictive fitness model for influenza , 2014, Nature.

[83]  Austin G. Meyer,et al.  Systematic humanization of yeast genes reveals conserved functions and genetic modularity , 2015, Science.

[84]  S. Sunyaev,et al.  Identification of cis-suppression of human disease mutations by comparative genomics , 2015, Nature.

[85]  Philip A. Romero,et al.  Dissecting enzyme function with microfluidic-based deep mutational scanning , 2015, Proceedings of the National Academy of Sciences.

[86]  David L. Young,et al.  Combining Natural Sequence Variation with High Throughput Mutational Data to Reveal Protein Interaction Sites , 2015, PLoS genetics.

[87]  M. Laub,et al.  Evolving New Protein-Protein Interaction Specificity through Promiscuous Intermediates , 2015, Cell.

[88]  David E. Kim,et al.  Large-scale determination of previously unsolved protein structures using evolutionary information , 2015, eLife.

[89]  R. Sun,et al.  Functional Constraint Profiling of a Viral Protein Reveals Discordance of Evolutionary Conservation and Functionality , 2015, PLoS genetics.

[90]  David L. Young,et al.  Massively Parallel Functional Analysis of BRCA1 RING Domain Variants , 2015, Genetics.

[91]  R. Ranganathan,et al.  Evolvability as a Function of Purifying Selection in TEM-1 β-Lactamase , 2015, Cell.

[92]  Karsten M. Borgwardt,et al.  The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity , 2015, Human mutation.

[93]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[94]  Michael T. Laub,et al.  Pervasive degeneracy and epistasis in a protein-protein interface , 2015, Science.

[95]  Robert D. Finn,et al.  Rfam 12.0: updates to the RNA families database , 2014, Nucleic Acids Res..

[96]  J. Kitzman,et al.  Massively Parallel Single Amino Acid Mutagenesis , 2014, Nature Methods.

[97]  Andrew Currin,et al.  Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently , 2014, Chemical Society reviews.

[98]  Ágnes Tóth-Petróczy,et al.  Systematic Mapping of Protein Mutational Space by Prolonged Drift Reveals the Deleterious Effects of Seemingly Neutral Mutations , 2015, PLoS Comput. Biol..

[99]  Martin Weigt,et al.  Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1 , 2015 .

[100]  J. Shendure,et al.  The power of multiplexed functional analysis of genetic variants , 2016, Nature Protocols.

[101]  Thomas A. Hopf,et al.  Structured States of Disordered Proteins from Genomic Sequences , 2016, Cell.

[102]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[103]  M. Weigt,et al.  Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1 , 2015, bioRxiv.

[104]  Dan S. Tawfik,et al.  Quantifying and understanding the fitness effects of protein mutations: Laboratory versus nature , 2016, Protein science : a publication of the Protein Society.

[105]  E. van Nimwegen Inferring Contacting Residues within and between Proteins: What Do the Probabilities Mean? , 2016, PLoS computational biology.

[106]  Adam P. Arkin,et al.  The Genome Project-Write , 2016, Science.

[107]  Jesse D. Bloom,et al.  Accurate measurement of the effects of all amino-acid mutations to influenza hemagglutinin , 2016, bioRxiv.

[108]  George M. Church,et al.  Design, synthesis, and testing toward a 57-codon genome , 2016, Science.

[109]  Jianzhi Zhang,et al.  The fitness landscape of a tRNA gene , 2016, Science.

[110]  Ricardo Villamarín-Salomón,et al.  ClinVar: public archive of interpretations of clinically relevant variants , 2015, Nucleic Acids Res..

[111]  D. Bolon,et al.  Systematic Mutant Analyses Elucidate General and Client-Specific Aspects of Hsp90 Function. , 2016, Cell reports.

[112]  Dmitry Chudakov,et al.  Local fitness landscape of the green fluorescent protein , 2016, Nature.

[113]  J. Valcárcel,et al.  The complete local genotype–phenotype landscape for the alternative splicing of a human exon , 2016, Nature Communications.

[114]  Sachdev S. Sidhu,et al.  Intracellular targeting with engineered proteins , 2016, F1000Research.

[115]  Epistasis and the dynamics of reversion in molecular evolution , 2016 .

[116]  L. Starita,et al.  Massively Parallel Functional Analysis of BRCA1 RING Domain Variants , 2017, Genetics.