State aggregation for fast likelihood computations in molecular evolution

Motivation: Codon models are widely used to identify the signature of selection at the molecular level and to test for changes in selective pressure during the evolution of genes encoding proteins. The large size of the state space of the Markov processes used to model codon evolution makes it difficult to use these models with large biological datasets. We propose here to use state aggregation to reduce the state space of codon models and, thus, improve the computational performance of likelihood estimation on these models. Results: We show that this heuristic speeds up the computations of the M0 and branch‐site models up to 6.8 times. We also show through simulations that state aggregation does not introduce a detectable bias. We analyzed a real dataset and show that aggregation provides highly correlated predictions compared to the full likelihood computations. Finally, state aggregation is a very general approach and can be applied to any continuous‐time Markov process‐based model with large state space, such as amino acid and coevolution models. We therefore discuss different ways to apply state aggregation to Markov models used in phylogenetics. Availability and Implementation: The heuristic is implemented in the godon package (https://bitbucket.org/Davydov/godon) and in a version of FastCodeML (https://gitlab.isb‐sib.ch/phylo/fastcodeml). Contact: nicolas.salamin@unil.ch Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[2]  Fredrik Ronquist Fast Fitch-Parsimony Algorithms for Large Data Sets , 1998 .

[3]  S. Muse,et al.  A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. , 1994, Molecular biology and evolution.

[4]  Thomas Ludwig,et al.  New fast and accurate heuristics for inference of large phylogenetic trees , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  L. Jermiin,et al.  Statistical tests to identify appropriate types of nucleotide sequence recoding in molecular phylogenetics , 2014, BMC Bioinformatics.

[6]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[7]  J. Hillston Compositional Markovian Modelling Using a Process Algebra , 1995 .

[8]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[9]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[10]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[11]  D. Penny,et al.  Genome-scale phylogeny and the detection of systematic biases. , 2004, Molecular biology and evolution.

[12]  Nicolas Lartillot,et al.  Conjugate Gibbs Sampling for Bayesian Phylogenetic Models , 2006, J. Comput. Biol..

[13]  Daniele Silvestro,et al.  Evolutionary footprint of coevolving positions in genes , 2014, Bioinform..

[14]  Adi Doron-Faigenboim,et al.  Evolutionary models accounting for layers of selection in protein-coding genes and their impact on the inference of positive selection. , 2011, Molecular biology and evolution.

[15]  Z. Yang,et al.  Models of amino acid substitution and applications to mitochondrial protein evolution. , 1998, Molecular biology and evolution.

[16]  Sergei L. Kosakovsky Pond,et al.  Detecting Individual Sites Subject to Episodic Diversifying Selection , 2012, PLoS genetics.

[17]  Ming-Yang Kao,et al.  Phylogeny Reconstruction , 2008, Encyclopedia of Algorithms.

[18]  Hiroshi Tanaka,et al.  An empirical examination of the utility of codon-substitution models in phylogeny reconstruction. , 2005, Systematic biology.

[19]  Christoph Pacher,et al.  Optimization strategies for fast detection of positive selection on phylogenetic trees , 2014, Bioinform..

[20]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[21]  Alessandro Vullo,et al.  Ensembl 2015 , 2014, Nucleic Acids Res..

[22]  Arnold Kuzniar,et al.  Selectome update: quality control and computational improvements to a database of positive selection , 2013, Nucleic Acids Res..

[23]  H. Philippe,et al.  A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. , 2004, Molecular biology and evolution.

[24]  Sergei L. Kosakovsky Pond,et al.  UC Office of the President Recent Work Title Less Is More : An Adaptive Branch-Site Random Effects Model for Efficient Detection of Episodic Diversifying Selection Permalink , 2015 .

[25]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[26]  Ghanim Ullah,et al.  Simplification of reversible Markov chains by removal of states with low equilibrium occupancy. , 2012, Journal of theoretical biology.

[27]  Hervé Philippe,et al.  Uniformization for sampling realizations of Markov processes: applications to Bayesian implementations of codon substitution models , 2008, Bioinform..

[28]  Joseph Felsenstein,et al.  Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters , 1973 .

[29]  David Haussler,et al.  Detecting Coevolution in and among Protein Domains , 2007, PLoS Comput. Biol..

[30]  David S. Gladstein,et al.  Efficient Incremental Character Optimization , 1997, Cladistics : the international journal of the Willi Hennig Society.

[31]  Pablo A. Goloboff,et al.  CHARACTER OPTIMIZATION AND CALCULATION OF TREE LENGTHS , 1993 .

[32]  Ben Murrell,et al.  RELAX: detecting relaxed selection in a phylogenetic framework. , 2014, Molecular biology and evolution.

[33]  Todd A. Castoe,et al.  Phylogenetics, likelihood, evolution and complexity , 2012, Bioinform..

[34]  R. Nielsen,et al.  Evaluation of an improved branch-site likelihood method for detecting positive selection at the molecular level. , 2005, Molecular biology and evolution.

[35]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[36]  S. Shechter,et al.  State‐space size considerations for disease‐progression models , 2013, Statistics in medicine.

[37]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[38]  Nick Goldman,et al.  Accuracy and Power of Statistical Methods for Detecting Adaptive Evolution in Protein Coding Sequences and for Identifying Positively Selected Sites , 2004, Genetics.

[39]  Nick Goldman,et al.  Markovian and Non-Markovian Protein Sequence Evolution: Aggregated Markov Process Models , 2011, Journal of molecular biology.

[40]  Edward Susko,et al.  On reduced amino acid alphabets for phylogenetic inference. , 2007, Molecular biology and evolution.

[41]  David J. Aldous,et al.  Lower bounds for covering times for reversible Markov chains and random walks on graphs , 1989 .

[42]  Troy C Messina,et al.  Hidden Markov model analysis of multichromophore photobleaching. , 2006, The journal of physical chemistry. B.

[43]  Heinz Koeppl,et al.  Model Decomposition and Stochastic Fragments , 2012, Electron. Notes Theor. Comput. Sci..

[44]  Olivier Gascuel,et al.  Improving the efficiency of SPR moves in phylogenetic tree search methods based on maximum likelihood , 2005, Bioinform..

[45]  Thomas Ludwig,et al.  RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees , 2005, Bioinform..

[46]  Sébastien Moretti,et al.  Selectome: a database of positive selection , 2008, Nucleic Acids Res..

[47]  Albert J. Vilella,et al.  EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. , 2009, Genome research.

[48]  Christoph Pacher,et al.  SlimCodeML: An Optimized Version of CodeML for the Branch-Site Model , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.