Alignment-Free Phylogenetic Reconstruction: Sample Complexity via a Branching Process Analysis

We present an efficient phylogenetic reconstruction algorithm allowing insertions and deletions which provably achieves a sequence-length requirement (or sample complexity) growing polynomially in the number of taxa. Our algorithm is distance-based, that is, it relies on pairwise sequence comparisons. More importantly, our approach largely bypasses the difficult problem of multiple sequence alignment.

[1]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[2]  Dirk Metzler,et al.  Statistical alignment based on fragment insertion and deletion models , 2003, Bioinform..

[3]  László A. Székely,et al.  Inverting random functions , 1999 .

[4]  László A. Székely,et al.  Inverting Random Functions II: Explicit Bounds for Discrete Maximum Likelihood Estimation, with Applications , 2002, SIAM J. Discret. Math..

[5]  Wen-Hsiung Li,et al.  Fundamentals of molecular evolution , 1990 .

[6]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[7]  Sébastien Roch,et al.  Sequence Length Requirement of Distance-Based Phylogeny Reconstruction: Breaking the Polynomial Barrier , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[8]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[9]  Roger Wheatcroft Fundamentals of Molecular Evolution.Second Edition. ByDan Graurand, Wen‐Hsiung Li.Sunderland (Massachusetts): Sinauer Associates. $48.95 (paper). xiv + 481 p; ill.; subject and taxonomic indexes. ISBN: 0–87893–266–6. 2000.DNA Technology: The Awesome Skill.Second Edition. ByI Edward Alcamo.San Diego , 2002 .

[10]  Li Zhang,et al.  On the complexity of distance-based evolutionary tree reconstruction , 2003, SODA '03.

[11]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[12]  Dan Graur,et al.  Fundamentals of Molecular Evolution, 2nd Edition , 2000 .

[13]  Isaac Elias,et al.  Settling the Intractability of Multiple Alignment , 2003, ISAAC.

[14]  Elchanan Mossel Distorted Metrics on Trees and Phylogenetic Forests , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Kevin Atteson,et al.  The Performance of Neighbor-Joining Methods of Phylogenetic Reconstruction , 1999, Algorithmica.

[16]  Mark Braverman,et al.  Phylogenetic Reconstruction with Insertions and Deletions , 2009 .

[17]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[18]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[19]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[20]  Constantinos Daskalakis,et al.  Alignment-Free Phylogenetic Reconstruction , 2010, RECOMB.

[21]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[22]  Alexandr Andoni,et al.  Global Alignment of Molecular Sequences via Ancestral State Reconstruction , 2010, ICS.

[23]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[24]  Tandy J. Warnow,et al.  A few logs suffice to build (almost) all trees (I) , 1999, Random Struct. Algorithms.

[25]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[26]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[27]  Elena Rivas,et al.  Probabilistic Phylogenetic Inference with Insertions and Deletions , 2008, PLoS Comput. Biol..

[28]  Ming-Yang Kao,et al.  Provably Fast and Accurate Recovery of Evolutionary Trees through Harmonic Greedy Triplets , 2000, SIAM J. Comput..

[29]  Elchanan Mossel,et al.  Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep , 2008, SIAM J. Discret. Math..

[30]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[31]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[32]  Benjamin D. Redelings,et al.  BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny , 2006, Bioinform..

[33]  Mikko Alava,et al.  Branching Processes , 2009, Encyclopedia of Complexity and Systems Science.

[34]  Elchanan Mossel,et al.  Learning nonsingular phylogenies and hidden Markov models , 2005, STOC '05.

[35]  M. Csűrös Fast recovery of evolutionary trees with thousands of nodes. , 2002, Journal of computational biology : a journal of computational molecular cell biology.

[36]  Elchanan Mossel,et al.  Optimal phylogenetic reconstruction , 2005, STOC '06.

[37]  A. Eyre-Walker Fundamentals of Molecular Evolution (2nd edn) , 2000, Heredity.

[38]  S. Karlin,et al.  A second course in stochastic processes , 1981 .

[39]  Joseph T. Chang,et al.  A signal-to-noise analysis of phylogeny estimation by neighbor-joining: Insufficiency of polynomial length sequences. , 2006, Mathematical biosciences.

[40]  Sagi Snir,et al.  Fast and reliable reconstruction of phylogenetic trees with very short edges , 2008, SODA '08.

[41]  Bhalchandra D Thatte,et al.  Invertibility of the TKF model of sequence evolution. , 2006, Mathematical biosciences.

[42]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[43]  Elchanan Mossel,et al.  Maximal Accurate Forests from Distance Matrices , 2006, RECOMB.

[44]  D. Mindell Fundamentals of molecular evolution , 1991 .