Statistically consistent and computationally efficient inference of ancestral DNA sequences in the TKF91 model under dense taxon sampling

In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and whose branchings indicate past speciation events. Phylogenetic analyses often rely on molecular sequences, such as DNA sequences, collected from the species of interest, and it is common in this context to employ statistical approaches based on stochastic models of sequence evolution on a tree. For tractability, such models necessarily make simplifying assumptions about the evolutionary mechanisms involved. In particular, commonly omitted are insertions and deletions of nucleotides—also known as indels. Properly accounting for indels in statistical phylogenetic analyses remains a major challenge in computational evolutionary biology. Here, we consider the problem of reconstructing ancestral sequences on a known phylogeny in a model of sequence evolution incorporating nucleotide substitutions, insertions and deletions, specifically the classical TKF91 process. We focus on the case of dense phylogenies of bounded height, which we refer to as the taxon-rich setting, where statistical consistency is achievable. We give the first explicit reconstruction algorithm with provable guarantees under constant rates of mutation. Our algorithm succeeds when the phylogeny satisfies the “big bang” condition, a necessary and sufficient condition for statistical consistency in this setting.

[1]  Elchanan Mossel Reconstruction on Trees: Beating the Second Eigenvalue , 2001 .

[2]  R. Durrett Probability: Theory and Examples , 1993 .

[3]  Alexandr Andoni,et al.  Global Alignment of Molecular Sequences via Ancestral State Reconstruction , 2009, ICS.

[4]  Y. Peres,et al.  Broadcasting on trees and the Ising model , 2000 .

[5]  Walter Gautschi,et al.  On inverses of Vandermonde and confluent Vandermonde matrices , 1962 .

[6]  Olivier Gascuel,et al.  Inferring ancestral sequences in taxon-rich phylogenies. , 2010, Mathematical biosciences.

[7]  Allan Sly,et al.  Reconstruction for the Potts model , 2009, STOC '09.

[8]  Sébastien Roch,et al.  Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees , 2017, ArXiv.

[9]  S. Karlin,et al.  A second course in stochastic processes , 1981 .

[10]  Qiuyi Zhang,et al.  Optimal sequence length requirements for phylogenetic tree reconstruction with indels , 2018, STOC.

[11]  W. J. Anderson Continuous-Time Markov Chains , 1991 .

[12]  Tandy J. Warnow,et al.  Large-Scale Multiple Sequence Alignment and Phylogeny Estimation , 2013, Models and Algorithms for Genome Evolution.

[13]  W. J. Anderson Continuous-Time Markov Chains: An Applications-Oriented Approach , 1991 .

[14]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[15]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[16]  Bhalchandra D Thatte,et al.  Invertibility of the TKF model of sequence evolution. , 2006, Mathematical biosciences.

[17]  Constantinos Daskalakis,et al.  Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis , 2011, ArXiv.

[18]  D. Liberles Ancestral sequence reconstruction , 2007 .

[19]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[20]  Walter Gautschi,et al.  On inverses of Vandermonde and confluent Vandermonde matrices. II , 1963 .

[21]  Michael Mitzenmacher,et al.  A Survey of Results for Deletion Channels and Related Synchronization Channels , 2008, SWAT.

[22]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.