Comparison of methods for calculating conditional expectations of sufficient statistics for continuous time Markov chains

BackgroundContinuous time Markov chains (CTMCs) is a widely used model for describing the evolution of DNA sequences on the nucleotide, amino acid or codon level. The sufficient statistics for CTMCs are the time spent in a state and the number of changes between any two states. In applications past evolutionary events (exact times and types of changes) are unaccessible and the past must be inferred from DNA sequence data observed in the present.ResultsWe describe and implement three algorithms for computing linear combinations of expected values of the sufficient statistics, conditioned on the end-points of the chain, and compare their performance with respect to accuracy and running time. The first algorithm is based on an eigenvalue decomposition of the rate matrix (EVD), the second on uniformization (UNI), and the third on integrals of matrix exponentials (EXPM). The implementation in R of the algorithms is available at http://www.birc.au.dk/~paula/.ConclusionsWe use two different models to analyze the accuracy and eight experiments to investigate the speed of the three algorithms. We find that they have similar accuracy and that EXPM is the slowest method. Furthermore we find that UNI is usually faster than EVD.

[1]  I Holmes,et al.  An expectation maximization algorithm for training hidden substitution models. , 2002, Journal of molecular biology.

[2]  C. Loan Computing integrals involving the matrix exponential , 1978 .

[3]  A. Hobolth,et al.  Statistical Applications in Genetics and Molecular Biology Statistical Inference in Evolutionary Models of DNA Sequences via the EM Algorithm , 2011 .

[4]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[5]  Ian Holmes,et al.  XRate: a fast prototyping, training and annotation tool for phylo-grammars , 2006, BMC Bioinformatics.

[6]  David Haussler,et al.  New Methods for Detecting Lineage-Specific Selection , 2006, RECOMB.

[7]  Julien Dutheil,et al.  Detecting groups of coevolving positions in a molecule: a clustering approach , 2007, BMC Evolutionary Biology.

[8]  Asger Hobolth,et al.  Summary statistics for end-point conditioned continuous-time Markov chains , 2010 .

[9]  M. Suchard,et al.  Learning to count: robust estimates for labeled distances between molecular sequences. , 2009, Molecular biology and evolution.

[10]  M. Nei,et al.  MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. , 2011, Molecular biology and evolution.

[11]  A. Jean-Marie,et al.  A model-based approach for detecting coevolving positions in a molecule. , 2005, Molecular biology and evolution.

[12]  Marc A Suchard,et al.  Counting labeled transitions in continuous-time Markov models of evolution , 2007, Journal of mathematical biology.

[13]  Bruce Rannala,et al.  Inferring complex DNA substitution processes on phylogenies using uniformization and data augmentation. , 2006, Systematic biology.

[14]  V. B. Yap,et al.  Estimating Substitution Matrices , 2005 .

[15]  A. Hobolth,et al.  Summary Statistics for Endpoint-Conditioned Continuous-Time Markov Chains , 2011, Journal of Applied Probability.

[16]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[17]  P. Lemey,et al.  Molecular Footprint of Drug-Selective Pressure in a Human Immunodeficiency Virus Transmission Chain , 2005, Journal of Virology.

[18]  Angelos Dassios,et al.  Double-barrier Parisian options , 2011 .

[19]  W. Sanders Adaptive Uniformization , 1994 .

[20]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[21]  Ian Holmes,et al.  An empirical codon model for protein sequence evolution. , 2007, Molecular biology and evolution.

[22]  Marc A Suchard,et al.  Fast, accurate and simulation-free stochastic mapping , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[23]  Julien Dutheil,et al.  Detecting Site-Specific Biochemical Constraints Through Substitution Mapping , 2008, Journal of Molecular Evolution.

[24]  Nicholas J. Higham,et al.  The Scaling and Squaring Method for the Matrix Exponential Revisited , 2005, SIAM J. Matrix Anal. Appl..