Efficient and consistent inference of ancestral sequences in an evolutionary model with insertions and deletions under dense taxon sampling

In evolutionary biology, the speciation history of living organisms is represented graphically by a phylogeny, that is, a rooted tree whose leaves correspond to current species and branchings indicate past speciation events. Phylogenies are commonly estimated from molecular sequences, such as DNA sequences, collected from the species of interest. At a high level, the idea behind this inference is simple: the further apart in the Tree of Life are two species, the greater is the number of mutations to have accumulated in their genomes since their most recent common ancestor. In order to obtain accurate estimates in phylogenetic analyses, it is standard practice to employ statistical approaches based on stochastic models of sequence evolution on a tree. For tractability, such models necessarily make simplifying assumptions about the evolutionary mechanisms involved. In particular, commonly omitted are insertions and deletions of nucleotides -- also known as indels. Properly accounting for indels in statistical phylogenetic analyses remains a major challenge in computational evolutionary biology. Here we consider the problem of reconstructing ancestral sequences on a known phylogeny in a model of sequence evolution incorporating nucleotide substitutions, insertions and deletions, specifically the classical TKF91 process. We focus on the case of dense phylogenies of bounded height, which we refer to as the taxon-rich setting, where statistical consistency is achievable. We give the first polynomial-time ancestral reconstruction algorithm with provable guarantees under constant rates of mutation. Our algorithm succeeds when the phylogeny satisfies the "big bang" condition, a necessary and sufficient condition for statistical consistency in this context.

[1]  W. J. Anderson Continuous-Time Markov Chains , 1991 .

[2]  M. Mitzenmacher A survey of results for deletion channels and related synchronization channels , 2009 .

[3]  Bhalchandra D Thatte,et al.  Invertibility of the TKF model of sequence evolution. , 2006, Mathematical biosciences.

[4]  Alexandr Andoni,et al.  Global Alignment of Molecular Sequences via Ancestral State Reconstruction , 2009, ICS.

[5]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[6]  Elchanan Mossel Reconstruction on Trees: Beating the Second Eigenvalue , 2001 .

[7]  S. Karlin,et al.  A second course in stochastic processes , 1981 .

[8]  Y. Peres,et al.  Broadcasting on trees and the Ising model , 2000 .

[9]  Walter Gautschi,et al.  On inverses of Vandermonde and confluent Vandermonde matrices , 1962 .

[10]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[11]  Sébastien Roch,et al.  Necessary and sufficient conditions for consistent root reconstruction in Markov models on trees , 2017, ArXiv.

[12]  Constantinos Daskalakis,et al.  Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis , 2011, ArXiv.

[13]  R. Durrett Probability: Theory and Examples , 1993 .

[14]  Olivier Gascuel,et al.  Inferring ancestral sequences in taxon-rich phylogenies. , 2010, Mathematical biosciences.

[15]  D. Liberles Ancestral sequence reconstruction , 2007 .

[16]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[17]  Allan Sly,et al.  Reconstruction for the Potts model , 2009, STOC '09.