Impossibility of phylogeny reconstruction from k-mer counts

We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed $k$ no statistically consistent phylogeny estimation is possible from $k$-mer counts of the leaf sequences alone. Formally, we establish that the joint leaf distributions of $k$-mer counts on two distinct trees have total variation distance bounded away from $1$ as the sequence length tends to infinity. That is, the two distributions cannot be distinguished with probability going to one in that asymptotic regime. Our results are information-theoretic: they imply an impossibility result for any reconstruction method using only $k$-mer counts at the leaves.

[1]  Elchanan Mossel Phase transitions in phylogeny , 2003, Transactions of the American Mathematical Society.

[3]  Qiuyi Zhang,et al.  Optimal sequence length requirements for phylogenetic tree reconstruction with indels , 2018, STOC.

[4]  Tandy Warnow,et al.  Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation , 2017 .

[5]  S. Péché,et al.  Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices , 2004, math/0403022.

[6]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Tandy J. Warnow,et al.  A few logs suffice to build (almost) all trees (I) , 1999, Random Struct. Algorithms.

[8]  Mike A. Steel,et al.  Phylogeny - discrete and random processes in evolution , 2016, CBMS-NSF regional conference series in applied mathematics.

[9]  Jonas S. Almeida,et al.  Alignment-free sequence comparison-a review , 2003, Bioinform..

[10]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[11]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[12]  Julian Parkhill,et al.  Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study , 2018, Wellcome open research.

[13]  M. Waterman,et al.  Distributional regimes for the number of k-word matches between two random sequences , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Allan Sly,et al.  Phase transition in the sample complexity of likelihood-based phylogeny inference , 2015, 1508.01964.

[15]  Y. Peres,et al.  Broadcasting on trees and the Ising model , 2000 .

[16]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (I): Statistics and Power , 2009, J. Comput. Biol..

[17]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[18]  Sébastien Roch,et al.  Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees , 2017, ArXiv.

[19]  David Burstein,et al.  The Average Common Substring Approach to Phylogenomic Reconstruction , 2006, J. Comput. Biol..

[20]  S. Tavaré,et al.  Line-of-descent and genealogical processes, and their applications in population genetics models. , 1984, Theoretical population biology.

[21]  R. Durrett Probability: Theory and Examples , 1993 .

[22]  Constantinos Daskalakis,et al.  Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis , 2011, ArXiv.

[23]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[24]  Bernhard Haubold,et al.  Alignment-free phylogenetics and population genetics , 2014, Briefings Bioinform..

[25]  Pavel A. Pevzner,et al.  Bioinformatics Algorithms: An Active Learning Approach , 2014 .

[26]  Andrew D. Barbour,et al.  Compound Poisson approximation: a user's guide , 2001 .

[27]  J. Farris A Probability Model for Inferring Evolutionary Trees , 1973 .

[28]  J. Qi,et al.  Whole Proteome Prokaryote Phylogeny Without Sequence Alignment: A K-String Composition Approach , 2003, Journal of Molecular Evolution.

[29]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[30]  Gesine Reinert,et al.  Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics , 2010, J. Comput. Biol..

[31]  V. Climenhaga Markov chains and mixing times , 2013 .

[32]  Seth Sullivant,et al.  Statistically Consistent k-mer Methods for Phylogenetic Tree Reconstruction , 2015, J. Comput. Biol..

[33]  Bernhard Haubold,et al.  andi: Fast and accurate estimation of evolutionary distances between closely related genomes , 2015, Bioinform..

[34]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[35]  Jukka Corander,et al.  Fast and flexible bacterial genomic epidemiology with PopPUNK , 2018, bioRxiv.

[36]  Anthony R. Ives,et al.  An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data , 2015, BMC Genomics.

[37]  J. A. Cavender Taxonomy with confidence , 1978 .

[38]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[39]  D. McDonald,et al.  An elementary proof of the local central limit theorem , 1995 .

[40]  S. Sullivant,et al.  Identifiability of Phylogenetic Parameters from k-mer Data Under the Coalescent , 2017, Bulletin of mathematical biology.