Probabilistic Graphical Model Representation in Phylogenetics

Recent years have seen a rapid expansion of the model space explored in statistical phylogenetics, emphasizing the need for new approaches to statistical model representation and software development. Clear communication and representation of the chosen model is crucial for: (i) reproducibility of an analysis, (ii) model development, and (iii) software design. Moreover, a unified, clear and understandable framework for model representation lowers the barrier for beginners and nonspecialists to grasp complex phylogenetic models, including their assumptions and parameter/variable dependencies. Graphical modeling is a unifying framework that has gained in popularity in the statistical literature in recent years. The core idea is to break complex models into conditionally independent distributions. The strength lies in the comprehensibility, flexibility, and adaptability of this formalism, and the large body of computational work based on it. Graphical models are well-suited to teach statistical models, to facilitate communication among phylogeneticists and in the development of generic software for simulation and statistical inference. Here, we provide an introduction to graphical models for phylogeneticists and extend the standard graphical model representation to the realm of phylogenetics. We introduce a new graphical model component, tree plates, to capture the changing structure of the subgraph corresponding to a phylogenetic tree. We describe a range of phylogenetic models using the graphical model framework and introduce modules to simplify the representation of standard components in large and complex models. Phylogenetic model graphs can be readily used in simulation, maximum likelihood inference, and Bayesian inference using, for example, Metropolis–Hastings or Gibbs sampling of the posterior distribution. [Computation; graphical models; inference; modularization; statistical phylogenetics; tree plate.]

[1]  Andrew Thomas,et al.  WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility , 2000, Stat. Comput..

[2]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[3]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[4]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[5]  J. Kingman On the genealogy of large populations , 1982 .

[6]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[7]  Tal Pupko,et al.  A structural EM algorithm for phylogenetic inference , 2001, J. Comput. Biol..

[8]  Itay Mayrose,et al.  A Gamma mixture model better accounts for among site rate heterogeneity , 2005, ECCB/JBI.

[9]  Maxim Teslenko,et al.  MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space , 2012, Systematic biology.

[10]  J. Huelsenbeck,et al.  Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. , 2008, Systematic biology.

[11]  J. Huelsenbeck Performance of Phylogenetic Methods in Simulation , 1995 .

[12]  M. Nei,et al.  A new method of inference of ancestral nucleotide and amino acid sequences. , 1995, Genetics.

[13]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[14]  Michael I. Jordan Graphical Models , 2003 .

[15]  Jr. G. Forney,et al.  Viterbi Algorithm , 1973, Encyclopedia of Machine Learning.

[16]  Sebastian Höhna,et al.  Fast simulation of reconstructed phylogenies under global time-dependent birth-death processes , 2013, Bioinform..

[17]  John P Huelsenbeck,et al.  A Dirichlet process model for detecting positive selection in protein-coding DNA sequences. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Erchin Serpedin,et al.  Mathematical Foundations for Signal Processing, Communications, and Networking , 2011 .

[19]  R. Nielsen,et al.  Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. , 2002, Molecular biology and evolution.

[20]  M. Gouy,et al.  Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. , 1998, Molecular biology and evolution.

[21]  Han Lin Shang,et al.  The BUGS book: a practical introduction to Bayesian analysis , 2013 .

[22]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[23]  S. Ferguson,et al.  On the evolution of the mammalian baculum: vaginal friction, prolonged intromission or induced ovulation? , 2002 .

[24]  J. Huelsenbeck,et al.  Bayesian Estimation of Positively Selected Sites , 2004, Journal of Molecular Evolution.

[25]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[26]  Bruce D. Patterson,et al.  The Mammalian Baculum: Hypotheses on the Nature of Bacular Variability , 1982 .

[27]  Jeremy M. Brown,et al.  PuMA: Bayesian analysis of partitioned (and unpartitioned) model adequacy , 2009, Bioinform..

[28]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[29]  H.-A. Loeliger,et al.  An introduction to factor graphs , 2004, IEEE Signal Process. Mag..

[30]  R M May,et al.  The reconstructed evolutionary process. , 1994, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[31]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[32]  N. Lartillot,et al.  A phylogenetic model for investigating correlated evolution of substitution rates and continuous phenotypic characters. , 2011, Molecular biology and evolution.

[33]  H. Philippe,et al.  A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. , 2004, Molecular biology and evolution.

[34]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[35]  Judea Pearl,et al.  Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach , 1982, AAAI.

[36]  Andrew Meade,et al.  Mixture models in phylogenetic inference , 2007, Mathematics of Evolution and Phylogeny.

[37]  J. Felsenstein Phylogenies and the Comparative Method , 1985, The American Naturalist.

[38]  H. Kishino,et al.  Estimating the rate of evolution of the rate of molecular evolution. , 1998, Molecular biology and evolution.

[39]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[40]  Alexei J Drummond,et al.  Guided tree topology proposals for Bayesian phylogenetic inference. , 2012, Systematic biology.

[41]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[42]  Jonathan P. Bollback,et al.  Bayesian model adequacy and choice in phylogenetics. , 2002, Molecular biology and evolution.

[43]  Andrew Thomas,et al.  The BUGS project: Evolution, critique and future directions , 2009, Statistics in medicine.

[44]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  David J. Spiegelhalter,et al.  Sequential updating of conditional probabilities on directed graphical structures , 1990, Networks.

[46]  Masami Hasegawa,et al.  Phylogenomic datasets provide both precision and accuracy in estimating the timescale of placental mammal phylogeny , 2012, Proceedings of the Royal Society B: Biological Sciences.

[47]  Michael J. Landis,et al.  Bayesian analysis of biogeography when the number of areas is large. , 2013, Systematic biology.

[48]  T. Heath,et al.  A hierarchical Bayesian model for calibrating estimates of species divergence times. , 2012, Systematic biology.

[49]  Manolo Gouy,et al.  A Mixture Model and a Hidden Markov Model to Simultaneously Detect Recombination Breakpoints and Reconstruct Phylogenies , 2009, Evolutionary bioinformatics online.

[50]  Elizabeth A. Thompson,et al.  Human Evolutionary Trees , 1975 .

[51]  S. Jeffery Evolution of Protein Molecules , 1979 .

[52]  M. Pagel,et al.  A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. , 2004, Systematic biology.

[53]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[54]  Hervé Philippe,et al.  Uniformization for sampling realizations of Markov processes: applications to Bayesian implementations of codon substitution models , 2008, Bioinform..

[55]  John P Huelsenbeck,et al.  A dirichlet process prior for estimating lineage-specific substitution rates. , 2012, Molecular biology and evolution.

[56]  C. A. Long,et al.  Morphometric Variation and Function in the Baculum, with Comments on Correlation of Parts , 1968 .

[57]  A. Drummond,et al.  Bayesian Inference of Species Trees from Multilocus Data , 2009, Molecular biology and evolution.

[58]  Nicolas Lartillot,et al.  A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. , 2006, Molecular biology and evolution.

[59]  Z. Yang,et al.  On the use of nucleic acid sequences to infer early branchings in the tree of life. , 1995, Molecular biology and evolution.

[60]  Walter R. Gilks,et al.  A Language and Program for Complex Bayesian Modelling , 1994 .

[61]  S. Höhna Likelihood Inference of Non-Constant Diversification Rates with Incomplete Taxon Sampling , 2014, PloS one.

[62]  M. Gouy,et al.  A Branch-Heterogeneous Model of Protein Evolution for Efficient Inference of Ancestral Sequences , 2013, Systematic biology.

[63]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[64]  Robert Michael Tanner,et al.  A recursive approach to low complexity codes , 1981, IEEE Trans. Inf. Theory.

[65]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[66]  K. Tamura,et al.  Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. , 1992, Molecular biology and evolution.

[67]  I. Lovette,et al.  Explosive Evolutionary Radiations: Decreasing Speciation or Increasing Extinction Through Time? , 2008, Evolution; international journal of organic evolution.

[68]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[69]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[70]  Andrew Gelman,et al.  R2WinBUGS: A Package for Running WinBUGS from R , 2005 .

[71]  Nir Friedman,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004, Science.

[72]  Ziheng Yang,et al.  Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci. , 2003, Genetics.