Stochastic errors vs. modeling errors in distance based phylogenetic reconstructions

BackgroundDistance-based phylogenetic reconstruction methods use evolutionary distances between species in order to reconstruct the phylogenetic tree spanning them. There are many different methods for estimating distances from sequence data. These methods assume different substitution models and have different statistical properties. Since the true substitution model is typically unknown, it is important to consider the effect of model misspecification on the performance of a distance estimation method.ResultsThis paper continues the line of research which attempts to adjust to each given set of input sequences a distance function which maximizes the expected topological accuracy of the reconstructed tree. We focus here on the effect of systematic error caused by assuming an inadequate model, but consider also the stochastic error caused by using short sequences. We introduce a theoretical framework for analyzing both sources of error based on the notion of deviation from additivity, which quantifies the contribution of model misspecification to the estimation error. We demonstrate this framework by studying the behavior of the Jukes-Cantor distance function when applied to data generated according to Kimura’s two-parameter model with a transition-transversion bias. We provide both a theoretical derivation for this case, and a detailed simulation study on quartet trees.ConclusionsWe demonstrate both analytically and experimentally that by deliberately assuming an oversimplified evolutionary model, it is possible to increase the topological accuracy of reconstruction. Our theoretical framework provides new insights into the mechanisms that enables statistically inconsistent reconstruction methods to outperform consistent methods.

[1]  Daniel Doerr,et al.  Stochastic errors vs. modeling errors in distance based phylogenetic reconstructions , 2011, Algorithms for Molecular Biology.

[2]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[3]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[4]  Bruno O. Shubert,et al.  Random variables and stochastic processes , 1979 .

[5]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[6]  Lee W. Johnson,et al.  Numerical Analysis , 1977 .

[7]  S. Tringe,et al.  Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments , 2007, Science.

[8]  Kevin Atteson,et al.  The Performance of Neighbor-Joining Methods of Phylogenetic Reconstruction , 1999, Algorithmica.

[9]  G. Oehlert A note on the delta method , 1992 .

[10]  B. Snel,et al.  Toward Automatic Reconstruction of a Highly Resolved Tree of Life , 2006, Science.

[11]  S. Jeffery Evolution of Protein Molecules , 1979 .

[12]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[13]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[14]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[15]  J. A. Studier,et al.  A note on the neighbor-joining algorithm of Saitou and Nei. , 1988, Molecular biology and evolution.

[16]  Jotun Hein,et al.  Statistical Methods in Bioinformatics: An Introduction , 2002 .

[17]  M. Nei,et al.  MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. , 2011, Molecular biology and evolution.

[18]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[19]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[20]  H. Saunders,et al.  Probability, Random Variables and Stochastic Processes (2nd Edition) , 1989 .

[21]  Aarnout Brombacher,et al.  Probability... , 2009, Qual. Reliab. Eng. Int..

[22]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[23]  Z. Yang,et al.  How often do wrong models produce better phylogenies? , 1997, Molecular biology and evolution.

[24]  J. A. Cavender Taxonomy with confidence , 1978 .

[25]  M. Steel,et al.  Recovering evolutionary trees under a more realistic model of sequence evolution. , 1994, Molecular biology and evolution.

[26]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[27]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[28]  Gerard Talavera,et al.  Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. , 2007, Systematic biology.

[29]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[30]  Tandy J. Warnow,et al.  A few logs suffice to build (almost) all trees (I) , 1999, Random Struct. Algorithms.

[31]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[32]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[33]  D Penny,et al.  Parsimony, likelihood, and the role of models in molecular phylogenetics. , 2000, Molecular biology and evolution.

[34]  G. Serio,et al.  A new method for calculating evolutionary substitution rates , 2005, Journal of Molecular Evolution.

[35]  J. Oliver,et al.  The general stochastic model of nucleotide substitution. , 1990, Journal of theoretical biology.

[36]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[37]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[38]  O. Gascuel,et al.  Efficient biased estimation of evolutionary distances when substitution rates vary across sites. , 2002, Molecular biology and evolution.

[39]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[40]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[41]  Irad Yavneh,et al.  Adaptive Distance Measures for Resolving K2P Quartets: Metric Separation versus Stochastic Noise , 2010, J. Comput. Biol..

[42]  Irad Yavneh,et al.  Towards optimal distance functions for stochastic substitution models. , 2009, Journal of theoretical biology.

[43]  E. Sober,et al.  A LIKELIHOOD JUSTIFICATION OF PARSIMONY , 1985, Cladistics : the international journal of the Willi Hennig Society.

[44]  M. Chial,et al.  in simple , 2003 .

[45]  A. Tversky,et al.  Additive similarity trees , 1977 .

[46]  K. Schleifer,et al.  Update of the All-Species Living Tree Project based on 16S and 23S rRNA sequence analyses. , 2010, Systematic and applied microbiology.

[47]  A. Zharkikh Estimation of evolutionary distances between nucleotide sequences , 1994, Journal of Molecular Evolution.

[48]  J G Sumner,et al.  Lie Markov models. , 2011, Journal of theoretical biology.

[49]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[50]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[51]  Joseph Felsenstein,et al.  Parsimony and likelihood: an exchange , 1986 .

[52]  W. Bruno,et al.  Topological bias and inconsistency of maximum likelihood using wrong models. , 1999, Molecular biology and evolution.