A Kolmogorov-Smirnov test for the molecular clock based on Bayesian ensembles of phylogenies

Divergence date estimates are central to understand evolutionary processes and depend, in the case of molecular phylogenies, on tests of molecular clocks. Here we propose two non-parametric tests of strict and relaxed molecular clocks built upon a framework that uses the empirical cumulative distribution (ECD) of branch lengths obtained from an ensemble of Bayesian trees and well known non-parametric (one-sample and two-sample) Kolmogorov-Smirnov (KS) goodness-of-fit test. In the strict clock case, the method consists in using the one-sample Kolmogorov-Smirnov (KS) test to directly test if the phylogeny is clock-like, in other words, if it follows a Poisson law. The ECD is computed from the discretized branch lengths and the parameter λ of the expected Poisson distribution is calculated as the average branch length over the ensemble of trees. To compensate for the auto-correlation in the ensemble of trees and pseudo-replication we take advantage of thinning and effective sample size, two features provided by Bayesian inference MCMC samplers. Finally, it is observed that tree topologies with very long or very short branches lead to Poisson mixtures and in this case we propose the use of the two-sample KS test with samples from two continuous branch length distributions, one obtained from an ensemble of clock-constrained trees and the other from an ensemble of unconstrained trees. Moreover, in this second form the test can also be applied to test for relaxed clock models. The use of a statistically equivalent ensemble of phylogenies to obtain the branch lengths ECD, instead of one consensus tree, yields considerable reduction of the effects of small sample size and provides a gain of power.

[1]  J H Gillespie,et al.  Natural selection and the molecular clock. , 1986, Molecular biology and evolution.

[2]  D. Bryant,et al.  A general comparison of relaxed molecular clock models. , 2007, Molecular biology and evolution.

[3]  Herold Dehling,et al.  Empirical Process Techniques for Dependent Data , 2002 .

[4]  J H Gillespie,et al.  The molecular clock may be an episodic clock. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[5]  F. Tajima,et al.  Simple methods for testing the molecular evolutionary clock hypothesis. , 1993, Genetics.

[6]  J. M. Hammersley,et al.  The “Effective” Number of Independent Observations in an Autocorrelated Time Series , 1946 .

[7]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[8]  M. Briones,et al.  Fungi Evolution Revisited: Application of the Penalized Likelihood Method to a Bayesian Fungal Phylogeny Provides a New Perspective on Phylogenetic Relationships and Divergence Dates of Ascomycota Groups , 2005, Journal of Molecular Evolution.

[9]  M. Briones,et al.  Experimental phylogeny of neutrally evolving DNA sequences generated by a bifurcate series of nested polymerase chain reactions. , 2002, Molecular biology and evolution.

[10]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[11]  M. Suchard,et al.  Bayesian Phylogenetics with BEAUti and the BEAST 1.7 , 2012, Molecular biology and evolution.

[12]  D. Posada jModelTest: phylogenetic model averaging. , 2008, Molecular biology and evolution.

[13]  J. Klotz ASYMPTOTIC EFFICIENCY OF THE KOLMOGOROV - SMIRNOV TEST , 1966 .

[14]  P. Schmid,et al.  On the Kolmogorov and Smirnov Limit Theorems for Discontinuous Distribution Functions , 1958 .

[15]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[16]  Istituto italiano degli attuari Giornale dell'Istituto italiano degli attuari , 1930 .

[17]  S. Horn,et al.  Goodness-of-fit tests for discrete data: a review and an application to a health impairment scale. , 1977, Biometrics.

[18]  Maxim Teslenko,et al.  MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice Across a Large Model Space , 2012, Systematic biology.

[19]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[20]  L. Pauling,et al.  Molecules as documents of evolutionary history. , 1965, Journal of theoretical biology.

[21]  Korbinian Strimmer,et al.  APE: Analyses of Phylogenetics and Evolution in R language , 2004, Bioinform..

[22]  N. Goldman Variance to mean ratio, R(t), for poisson processes on phylogenetic trees. , 1994, Molecular phylogenetics and evolution.

[23]  S. Yue,et al.  The Mann-Kendall Test Modified by Effective Sample Size to Detect Trend in Serially Correlated Hydrological Series , 2004 .

[24]  M. Stephens Use of the Kolmogorov-Smirnov, Cramer-Von Mises and Related Statistics without Extensive Tables , 1970 .

[25]  A. Papadopoulos,et al.  On the Kolmogorov-Smirnov test for the Poisson distribution with unknown parameter , 2003 .

[26]  N. Takahata,et al.  On the overdispersed molecular clock. , 1987, Genetics.

[27]  M. Kimura Evolutionary Rate at the Molecular Level , 1968, Nature.

[28]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[29]  T. Ohta,et al.  Protein Polymorphism as a Phase of Molecular Evolution , 1971, Nature.

[30]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[31]  F. Massey,et al.  Distribution Table for the Deviation Between two Sample Cumulatives , 1952 .

[32]  L. Pauling,et al.  Evolutionary Divergence and Convergence in Proteins , 1965 .

[33]  Leon Jay Gleser,et al.  Exact Power of Goodness-of-Fit Tests of Kolmogorov Type for Discontinuous Distributions , 1985 .

[34]  W. Conover A Kolmogorov Goodness-of-Fit Test for Discontinuous Distributions , 1972 .

[35]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[36]  J. L. King,et al.  Non-Darwinian evolution. , 1969, Science.

[37]  J. Frey An exact Kolmogorov–Smirnov test for the Poisson distribution with unknown mean , 2012 .

[38]  M. Stephens EDF Statistics for Goodness of Fit and Some Comparisons , 1974 .

[39]  S. Ho,et al.  Accuracy of rate estimation using relaxed-clock models with a critical focus on the early metazoan radiation. , 2005, Molecular biology and evolution.

[40]  A. N. Shiryayev,et al.  15. On The Empirical Determination of A Distribution Law , 1992 .

[41]  C. Oprian,et al.  On the KOLMOGOROV‐SMIRNOV Test for the POISSON Distribution with Unknown Mean , 1979 .

[42]  N. Takahata Statistical models of the overdispersed molecular clock. , 1991, Theoretical population biology.

[43]  T. Jukes,et al.  The neutral theory of molecular evolution. , 2000, Genetics.

[44]  Alfréd Rényi,et al.  On an extremal property of the poisson process , 1964 .

[45]  AN Kolmogorov-Smirnov,et al.  Sulla determinazione empírica di uma legge di distribuzione , 1933 .

[46]  Bernard M. E. Moret,et al.  Phylogenetic Inference , 2011, Encyclopedia of Parallel Computing.

[47]  Jean-Philippe Bouchaud,et al.  Goodness-of-fit tests with dependent observations , 2011, 1106.3016.

[48]  Norbert Henze,et al.  Empirical‐distribution‐function goodness‐of‐fit tests for discrete models , 1996 .

[49]  W. Li,et al.  Evidence for higher rates of nucleotide substitution in rodents than in man. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[50]  G. E. Noether Note on the kolmogorov statistic in the discrete case , 1963 .

[51]  J. Gillespie,et al.  RATES OF MOLECULAR EVOLUTION , 1986 .

[52]  A. Pettitt,et al.  The Kolmogorov-Smirnov Goodness-of-Fit Statistic with Discrete and Grouped Data , 1977 .

[53]  Marc S. Weiss Modification of the Kolmogorov-Smirnov Statistic for Use with Correlated Data , 1978 .

[54]  Constance L. Wood,et al.  Large Sample Results for Kolmogorov-Smirnov Statistics for Discrete Distributions , 1978 .

[55]  J. Felsenstein Phylogenies from molecular sequences: inference and reliability. , 1988, Annual review of genetics.

[56]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[57]  N. Smirnov Table for Estimating the Goodness of Fit of Empirical Distributions , 1948 .

[58]  R. Nielsen Robustness of the estimator of the index of dispersion for DNA sequences. , 1997, Molecular phylogenetics and evolution.

[59]  Q Zheng,et al.  On the dispersion index of a Markovian molecular clock. , 2001, Mathematical biosciences.

[60]  Larry Wasserman,et al.  All of Nonparametric Statistics (Springer Texts in Statistics) , 2006 .

[61]  P. Gingerich Molecular Evolutionary Clocks , 1985, Science.

[62]  M. Kimura Molecular evolutionary clock and the neutral theory , 2005, Journal of Molecular Evolution.

[63]  Stergios B. Fotopoulos,et al.  All of Nonparametric Statistics , 2007, Technometrics.