Hypothesis tests for phylogenetic quartets, with applications to coalescent-based species tree inference.

Numerous statistical methods have been developed to estimate evolutionary relationships among a collection of present-day species, typically represented by a phylogenetic tree, using the information contained in the DNA sequences sampled from representatives of each species. In the current era of high-throughput genome sequencing, the models underlying such methods have become increasingly sophisticated, and the resulting computations are often prohibitive. Here we consider the problem of rigorously testing the phylogenetic relationships among collections of four species under the multispecies coalescent model that accommodates both multi-locus datasets and SNP data. Our test employs a new statistic - the summed absolute differences between certain columns in flattened phylogenetic matrices - as well as a previously used statistic that measures the distance of a flattened matrix from the space of rank-10 matrices. We derive distributional results for both statistics and study the performance of the corresponding hypothesis tests using both simulated and empirical data. We discuss how these tests may be used to improve inference of phylogenetic relationships for larger samples of species under the multispecies coalescent model, a problem that has until recently been computationally intractable.

[1]  J. Kingman On the genealogy of large populations , 1982 .

[2]  Satish Rao,et al.  Quartets MaxCut: A Divide and Conquer Quartets Algorithm , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  James H. Degnan,et al.  GENE TREE DISTRIBUTIONS UNDER THE COALESCENT PROCESS , 2005, Evolution; international journal of organic evolution.

[4]  M. O'Neill,et al.  ASYMPTOTIC DISTRIBUTIONS OF THE CANONICAL CORRELATIONS FROM CONTINGENCY TABLES , 1978 .

[5]  Laura Kubatko,et al.  Identifiability of the unrooted species tree topology under the coalescent model with time-reversible substitution processes, site-specific rate variation, and invariable sites. , 2014, Journal of theoretical biology.

[6]  Rezwana Reaz,et al.  Accurate Phylogenetic Tree Reconstruction from Quartets: A Heuristic Approach , 2014, PloS one.

[7]  Mike Steel,et al.  Patching Up X-Trees , 1999 .

[8]  Elizabeth S. Allman,et al.  Phylogenetic ideals and varieties for the general Markov model , 2004, Adv. Appl. Math..

[9]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[10]  Laura Salter Kubatko,et al.  Quartet Inference from SNP Data Under the Coalescent Model , 2014, Bioinform..

[11]  Tandy J. Warnow,et al.  ASTRAL: genome-scale coalescent-based species tree estimation , 2014, Bioinform..

[12]  É. Tannier,et al.  The Inference of Gene Trees with Species Trees , 2013, Systematic biology.

[13]  Nick Goldman,et al.  Statistical tests of models of DNA substitution , 1993, Journal of Molecular Evolution.

[14]  Scott V Edwards,et al.  Coalescent methods for estimating phylogenetic trees. , 2009, Molecular phylogenetics and evolution.

[15]  Mike Steel,et al.  Patching upX-trees , 1999 .

[16]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[17]  S. Edwards IS A NEW AND GENERAL THEORY OF MOLECULAR SYSTEMATICS EMERGING? , 2009, Evolution; international journal of organic evolution.

[18]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[19]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[20]  Satish Rao,et al.  Quartet MaxCut: a fast algorithm for amalgamating quartet trees. , 2012, Molecular phylogenetics and evolution.