Detecting Amino Acid Coevolution with Bayesian Graphical Models.

The comparative study of homologous proteins can provide abundant information about the functional and structural constraints on protein evolution. For example, an amino acid substitution that is deleterious may become permissive in the presence of another substitution at a second site of the protein. A popular approach for detecting coevolving residues is by looking for correlated substitution events on branches of the molecular phylogeny relating the protein-coding sequences. Here we describe a machine learning method (Bayesian graphical models) implemented in the open-source phylogenetic software package HyPhy, http://hyphy.org , for extracting a network of coevolving residues from a sequence alignment.

[1]  Mathieu Bastian,et al.  Gephi: An Open Source Software for Exploring and Manipulating Networks , 2009, ICWSM.

[2]  A. Finkelstein,et al.  A structural perspective of compensatory evolution , 2014, Current opinion in structural biology.

[3]  L. C. Martin,et al.  Using information theory to search for co-evolving residues in proteins , 2005, Bioinform..

[4]  W R Taylor,et al.  Coevolving protein residues: maximum likelihood identification and relationship to structure. , 1999, Journal of molecular biology.

[5]  Graziano Pesole,et al.  Correlated substitution analysis and the prediction of amino acid structural contacts , 2007, Briefings Bioinform..

[6]  P. Tuff,et al.  Exploring a phylogenetic approach for the detection of correlated substitutions in proteins. , 2000, Molecular biology and evolution.

[7]  E. Neher How frequent are correlated changes in families of protein sequences? , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Simon Whelan,et al.  Covariation Is a Poor Measure of Molecular Coevolution , 2015, Molecular biology and evolution.

[9]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[10]  Dongsup Kim,et al.  Reliable and robust detection of coevolving protein residues. , 2012, Protein engineering, design & selection : PEDS.

[11]  R. Shamir,et al.  A fast algorithm for joint reconstruction of ancestral amino acid sequences. , 2000, Molecular biology and evolution.

[12]  C. Sander,et al.  Can three-dimensional contacts in protein structures be predicted by analysis of correlated mutations? , 1994, Protein engineering.

[13]  Sergei L. Kosakovsky Pond,et al.  HyPhy: hypothesis testing using phylogenies , 2005, Bioinform..

[14]  J. Lara,et al.  Coordinated evolution of the hepatitis C virus , 2008, Proceedings of the National Academy of Sciences.

[15]  Daniel Crisan,et al.  euHCVdb: the European hepatitis C virus database , 2006, Nucleic Acids Res..

[16]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[17]  Nigel F. Delaney,et al.  Darwinian Evolution Can Follow Only Very Few Mutational Paths to Fitter Proteins , 2006, Science.

[18]  Sergei L. Kosakovsky Pond,et al.  Spidermonkey: rapid detection of co-evolving sites using Bayesian graphical models , 2008, Bioinform..

[19]  D. Posada Using MODELTEST and PAUP* to Select a Model of Nucleotide Substitution , 2003, Current protocols in bioinformatics.

[20]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[21]  E. Sprinzak,et al.  Correlated sequence-signatures as markers of protein-protein interaction. , 2001, Journal of molecular biology.

[22]  A. Valencia,et al.  Emerging methods in protein co-evolution , 2013, Nature Reviews Genetics.

[23]  Gamal Esmat,et al.  Global prevalence and genotype distribution of hepatitis C virus infection in 2015: a modelling study. , 2017, The lancet. Gastroenterology & hepatology.

[24]  R. Nielsen Mapping mutations on phylogenies. , 2002, Systematic biology.

[25]  Nir Friedman,et al.  Being Bayesian About Network Structure. A Bayesian Approach to Structure Discovery in Bayesian Networks , 2004, Machine Learning.

[26]  S. Muse,et al.  A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. , 1994, Molecular biology and evolution.

[27]  B. Rost,et al.  Effective use of sequence correlation and conservation in fold recognition. , 1999, Journal of molecular biology.

[28]  Alvaro Mena,et al.  Update on hepatitis C virus resistance to direct-acting antiviral agents. , 2014, Antiviral research.

[29]  M. Plummer,et al.  CODA: convergence diagnosis and output analysis for MCMC , 2006 .

[30]  Gregory B. Gloor,et al.  Mutual information is critically dependent on prior assumptions: would the correct estimate of mutual information please identify itself? , 2010, Bioinform..

[31]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[32]  Dmitrij Frishman,et al.  Correlated Mutations: A Hallmark of Phenotypic Amino Acid Substitutions , 2010, PLoS Comput. Biol..

[33]  G. Gloor,et al.  Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. , 2005, Biochemistry.

[34]  S. Holmes,et al.  Bootstrapping Phylogenetic Trees: Theory and Methods , 2003 .

[35]  D. Kihara The effect of long‐range interactions on the secondary structure formation of proteins , 2005, Protein science : a publication of the Protein Society.

[36]  D. Maddison,et al.  NEXUS: an extensible file format for systematic information. , 1997, Systematic biology.

[37]  R. Aldrich,et al.  Influence of conservation on calculations of amino acid covariance in multiple sequence alignments , 2004, Proteins.

[38]  R. Aurora,et al.  Genome-wide hepatitis C virus amino acid covariance networks can predict response to antiviral therapy in humans. , 2008, The Journal of clinical investigation.

[39]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[40]  Judea Pearl,et al.  Fusion, Propagation, and Structuring in Belief Networks , 1986, Artif. Intell..

[41]  P. Simmonds,et al.  Genetic diversity and evolution of hepatitis C virus--15 years on. , 2004, The Journal of general virology.

[42]  Sergei L. Kosakovsky Pond,et al.  Datamonkey 2010: a suite of phylogenetic analysis tools for evolutionary biology , 2010, Bioinform..

[43]  W. Atchley,et al.  Correlations among amino acid sites in bHLH protein domains: an information theoretic analysis. , 2000, Molecular biology and evolution.

[44]  K. Katoh,et al.  MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability , 2013, Molecular biology and evolution.

[45]  David K. Smith,et al.  ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data , 2017 .

[46]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[47]  Ramón Doallo,et al.  CircadiOmics: integrating circadian genomics, transcriptomics, proteomics and metabolomics , 2012, Nature Methods.

[48]  Thomas W. H. Lui,et al.  Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments , 2003, Bioinform..

[49]  C. Cameron,et al.  A novel mechanism to ensure terminal initiation by hepatitis C virus NS5B polymerase. , 2001, Virology.

[50]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[51]  Richard H. Liang,et al.  Global origin and transmission of hepatitis C virus nonstructural protein 3 Q80K polymorphism. , 2015, The Journal of infectious diseases.

[52]  J. Hirschhorn,et al.  A comprehensive review of genetic association studies , 2002, Genetics in Medicine.

[53]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[54]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[55]  C. Sander,et al.  Correlated mutations and residue contacts in proteins , 1994, Proteins.

[56]  William R Taylor,et al.  Prediction of contacts from correlated sequence substitutions. , 2013, Current opinion in structural biology.

[57]  A. Lapedes,et al.  Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Sergei L. Kosakovsky Pond,et al.  An Evolutionary-Network Model Reveals Stratified Interactions in the V3 Loop of the HIV-1 Envelope , 2007, PLoS Comput. Biol..

[59]  W. Atchley,et al.  Separation of phylogenetic and functional associations in biological sequences by using the parametric bootstrap. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[61]  Anders Larsson,et al.  AliView: a fast and lightweight alignment viewer and editor for large datasets , 2014, Bioinform..

[62]  Emden R. Gansner,et al.  Graphviz - Open Source Graph Drawing Tools , 2001, GD.

[63]  J. Felsenstein Phylogenies and the Comparative Method , 1985, The American Naturalist.

[64]  Anders Gorm Pedersen,et al.  Finding coevolving amino acid residues using row and column weighting of mutual information and multi-dimensional amino acid representation , 2007, Algorithms for molecular biology : AMB.