Evolutionary inference via the Poisson Indel Process

We address the problem of the joint statistical inference of phylogenetic trees and multiple sequence alignments from unaligned molecular sequences. This problem is generally formulated in terms of string-valued evolutionary processes along the branches of a phylogenetic tree. The classic evolutionary process, the TKF91 model [Thorne JL, Kishino H, Felsenstein J (1991) J Mol Evol 33(2):114–124] is a continuous-time Markov chain model composed of insertion, deletion, and substitution events. Unfortunately, this model gives rise to an intractable computational problem: The computation of the marginal likelihood under the TKF91 model is exponential in the number of taxa. In this work, we present a stochastic process, the Poisson Indel Process (PIP), in which the complexity of this computation is reduced to linear. The Poisson Indel Process is closely related to the TKF91 model, differing only in its treatment of insertions, but it has a global characterization as a Poisson process on the phylogeny. Standard results for Poisson processes allow key computations to be decoupled, which yields the favorable computational profile of inference under the PIP model. We present illustrative experiments in which Bayesian inference under the PIP model is compared with separate inference of phylogenies and alignments.

[1]  J. Doob Markoff chains—denumerable case , 1945 .

[2]  D. Cox Some Statistical Methods Connected with Series of Events , 1955 .

[3]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[4]  D. Gillespie Exact Stochastic Simulation of Coupled Chemical Reactions , 1977 .

[5]  D. Robinson,et al.  Comparison of weighted labelled trees , 1979 .

[6]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[7]  Decision Systems.,et al.  Convergence of the simulated annealing algorithm , 1988 .

[8]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[9]  J. Hein Unified approach to alignment and phylogenies. , 1990, Methods in enzymology.

[10]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[11]  W. Wheeler,et al.  MALIGN: A Multiple Sequence Alignment Program , 1994 .

[12]  David B. Searls,et al.  Automata-Theoretic Models of Mutation and Alignment , 1995, ISMB.

[13]  R. Ravi,et al.  GESTALT: Genomic Steiner Alignments , 1999, CPM.

[14]  J. Huelsenbeck,et al.  Effect of nonindependent substitution on phylogenetic accuracy. , 1999, Systematic biology.

[15]  S. Kelchner The Evolution of Non-Coding Chloroplast DNA and Its Application in Plant Systematics , 2000 .

[16]  J. M. Sauder,et al.  Large‐scale comparison of protein sequence alignment algorithms with structure alignments , 2000, Proteins.

[17]  J. Huelsenbeck,et al.  A compound poisson process for relaxing the molecular clock. , 2000, Genetics.

[18]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[19]  J. Zhang,et al.  Protein-length distributions for the three domains of life. , 2000, Trends in genetics : TIG.

[20]  Mike A. Steel,et al.  Applying the Thorne-Kishino-Felsenstein model to sequence evolution on a star-shaped tree , 2001, Appl. Math. Lett..

[21]  Jotun Hein,et al.  An Algorithm for Statistical Alignment of Sequences Related by a Binary Tree , 2000, Pacific Symposium on Biocomputing.

[22]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[23]  Zoltán Toroczkai,et al.  An Improved Model for Statistical Alignment , 2001, WABI.

[24]  Dan Gusfield,et al.  Algorithms in Bioinformatics , 2002, Lecture Notes in Computer Science.

[25]  István Miklós Algorithm for statistical alignment of two sequences derived from a Poisson sequence length distribution , 2003, Discret. Appl. Math..

[26]  Jens Ledet Jensen,et al.  Recursions for statistical multiple alignment , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[27]  M. Miyamoto,et al.  Sequence alignments and pair hidden Markov models using evolutionary history. , 2003, Journal of molecular biology.

[28]  M. Chial,et al.  in simple , 2003 .

[29]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[30]  István Miklós,et al.  Bayesian Phylogenetic Inference under a Statistical Insertion-Deletion Model , 2003, WABI.

[31]  Elena Rivas,et al.  Evolutionary models for insertions and deletions in a probabilistic modeling framework , 2005, BMC Bioinformatics.

[32]  Ian Holmes,et al.  A probabilistic model for the evolution of RNA structure , 2004, BMC Bioinformatics.

[33]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[34]  C. S. Wallace,et al.  Finite-state models in the alignment of macromolecules , 1992, Journal of Molecular Evolution.

[35]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[36]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[37]  J. L. Jensen,et al.  GIBBS SAMPLER FOR STATISTICAL MULTIPLE ALIGNMENT , 2005 .

[38]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[39]  István Miklós,et al.  Statistical Alignment: Recent Progress, New Applications, and Challenges , 2005 .

[40]  G. Crooks,et al.  A generalized affine gap model significantly improves protein sequence alignment accuracy , 2004, Proteins.

[41]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[42]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[43]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[44]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[45]  Benjamin D. Redelings,et al.  BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny , 2006, Bioinform..

[46]  M. Suchard,et al.  Incorporating indel information into phylogeny estimation for rapidly emerging pathogens , 2007, BMC Evolutionary Biology.

[47]  Yun S. Song A Sufficient Condition for Reducing Recursions in Hidden Markov Models , 2006, Bulletin of mathematical biology.

[48]  P. Forster,et al.  Phylogenetic Methods and the Prehistory of Languages , 2006 .

[49]  Satish Chikkagoudar,et al.  Improving progressive alignment for phylogeny reconstruction using parsimonious guide-trees , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[50]  Lior Pachter,et al.  Phylogenetic Profiling of Insertions and Deletions in Vertebrate Genomes , 2006, RECOMB.

[51]  C. Cannings Statistical Methods in Molecular Evolution , 2006 .

[52]  Lior Pachter,et al.  Multiple alignment by sequence annealing , 2007, Bioinform..

[53]  Christian P. Robert,et al.  The Bayesian choice : from decision-theoretic foundations to computational implementation , 2007 .

[54]  Christopher J. Lee,et al.  Wagner and Dollo: a stochastic duet by composing two parsimonious solos. , 2008, Systematic biology.

[55]  Markus Dreyer,et al.  Latent-Variable Modeling of String Transductions with Finite-State Methods , 2008, EMNLP.

[56]  István Miklós,et al.  StatAlign: an extendable software package for joint Bayesian estimation of alignments and evolutionary trees , 2008, Bioinform..

[57]  Dan Klein,et al.  Efficient Inference in Phylogenetic InDel Trees , 2008, NIPS.

[58]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[59]  Tandy J. Warnow,et al.  The Effect of the Guide Tree on Multiple Sequence Alignments and Subsequent Phylogenetic Analysis , 2007, Pacific Symposium on Biocomputing.

[60]  J. Huelsenbeck,et al.  Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. , 2008, Systematic biology.

[61]  M. Droste,et al.  Handbook of Weighted Automata , 2009 .

[62]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[63]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[64]  Mehryar Mohri,et al.  Weighted Automata Algorithms , 2009 .

[65]  W. Wheeler,et al.  POY version 4: phylogenetic analysis using dynamic homologies , 2010, Cladistics : the international journal of the Willi Hennig Society.

[66]  S. Roch Toward Extracting All Phylogenetic Information from Matrices of Evolutionary Distances , 2010, Science.

[67]  M. R. Leadbetter Poisson Processes , 2011, International Encyclopedia of Statistical Science.

[68]  Lior Pachter,et al.  Tracing the Most Parsimonious Indel History , 2011, J. Comput. Biol..

[69]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[70]  Serita M. Nelesen,et al.  SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees. , 2012, Systematic biology.

[71]  I. Holmes,et al.  Accurate Reconstruction of Insertion-Deletion Histories by Statistical Phylogenetics , 2012, PloS one.

[72]  Albert J. Vilella,et al.  Accurate extension of multiple sequence alignments using a phylogeny-aware graph algorithm , 2012, Bioinform..

[73]  Forrest W. Crawford,et al.  Transition probabilities for general birth–death processes with applications in ecology, genetics, and evolution , 2011, Journal of Mathematical Biology.

[74]  Michael I. Jordan,et al.  Phylogenetic Inference via Sequential Monte Carlo , 2012, Systematic biology.

[75]  A. von Haeseler,et al.  Assessing Variability by Joint Sampling of Alignments and Mutation Rates , 2001, Journal of Molecular Evolution.