Rec-I-DCM3: a fast algorithmic technique for reconstructing phylogenetic trees

Phylogenetic trees are commonly reconstructed based on hard optimization problems such as maximum parsimony (MP) and maximum likelihood (ML). Conventional MP heuristics for producing phylogenetic trees produce good solutions within reasonable time on small datasets (up to a few thousand sequences), while ML heuristics are limited to smaller datasets (up to a few hundred sequences). However, since MP (and presumably ML) is NP-hard, such approaches do not scale when applied to large datasets. In this paper, we present a new technique called Recursive-Iterative-DCM3 (Rec-I-DCM3), which belongs to our family of disk-covering methods (DCMs). We tested this new technique on ten large biological datasets ranging from 1,322 to 13,921 sequences and obtained dramatic speedups as well as significant improvements in accuracy (better than 99.99%) in comparison to existing approaches. Thus, high-quality reconstructions can be obtained for datasets at least ten times larger than was previously possible.

[1]  R. Sokal,et al.  A QUANTITATIVE APPROACH TO A PROBLEM IN CLASSIFICATION† , 1957, Evolution; International Journal of Organic Evolution.

[2]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[3]  Peter Buneman,et al.  A characterisation of rigid circuit graphs , 1974, Discret. Math..

[4]  M. Golumbic Algorithmic graph theory and perfect graphs , 1980 .

[5]  R. Graham,et al.  The steiner problem in phylogeny is NP-complete , 1982 .

[6]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[7]  D. Maddison The discovery and importance of multiple islands of most , 1991 .

[8]  D. Ord,et al.  PAUP:Phylogenetic analysis using parsi-mony , 1993 .

[9]  Michael J. Sanderson,et al.  The Growth of Phylogenetic Information and the Need for a Phylogenetic Data Base , 1993 .

[10]  Roderic D. M. Page,et al.  On islands of trees and the efficacy of different methods of branch swapping in finding most-parsimonious trees , 1993 .

[11]  Daniel H. Huson,et al.  Solving Large Scale Phylogenetic Problems using DCM2 , 1999, ISMB.

[12]  Bernard M. E. Moret,et al.  DIMACS Series in Discrete Mathematics and Theoretical Computer Science Towards a Discipline of Experimental Algorithmics , 2022 .

[13]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[14]  P. Goloboff Analyzing Large Data Sets in Reasonable Times: Solutions for Composite Optima , 1999, Cladistics : the international journal of the Willi Hennig Society.

[15]  Tandy J. Warnow,et al.  Reconstructing Optimal Phylogenetic Trees: A Challenge in Experimental Algorithmics , 2000, Experimental Algorithmics.

[16]  James R. Cole,et al.  The RDP (Ribosomal Database Project) continues , 2000, Nucleic Acids Res..

[17]  W. Kress,et al.  Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences , 2000 .

[18]  K. Nixon,et al.  The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis , 1999, Cladistics : the international journal of the Willi Hennig Society.

[19]  A. Purvis,et al.  Changing the landscape: a new strategy for estimating large phylogenies. , 2001, Systematic biology.

[20]  Tandy J. Warnow,et al.  Absolute convergence: true trees from short sequences , 2001, SODA '01.

[21]  Tandy J. Warnow,et al.  Designing fast converging phylogenetic methods , 2001, ISMB.

[22]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[23]  Yves Van de Peer,et al.  The European database on small subunit ribosomal RNA , 2002, Nucleic Acids Res..

[24]  Jijun Tang,et al.  Scaling up accurate phylogenetic reconstruction from gene-order data , 2003, ISMB.

[25]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[26]  Bernard M. E. Moret,et al.  Performance of Supertree Methods on Various Data Set Decompositions , 2004 .

[27]  J. Farris,et al.  Simultaneous parsimony jackknife analysis of 2538rbcL DNA sequences reveals support for major clades of green plants, land plants, seed plants and flowering plants , 1998, Plant Systematics and Evolution.