Performance of Supertree Methods on Various Data Set Decompositions

Many large-scale phylogenetic reconstruction methods attempt to solve hard optimization problems such as Maximum Parsimony (MP) and Maximum Likelihood (ML), but they are severely limited by the number of taxa that they can handle in a reasonable timeframe. A standard heuristic approach to this problem is the divide-and-conquer strategy: decompose the data set into smaller subsets, solve the subsets (i.e., use MP or ML on each subset to obtain trees), and then combine the solutions to the subsets into a solution for the original data set. This last step — combining given trees into a single tree — is known as supertree construction in computational phylogenetics. The traditional application of supertree methods is to combine existing, published phylogenies into a single phylogeny. Here, we study supertree construction in the context of divide-and-conquer methods for large-scale tree reconstruction. We study several divide-and-conquer approaches and demonstrate experimentally their advantage over the traditional supertree technique of Matrix Representation with Parsimony (MRP), and over global heuristics such as the parsimony ratchet. For the ten large biological data sets under investigation, our study shows that the techniques used for dividing the data set into subproblems as well as those used for merging them into a single solution influence the quality of the supertree construction strongly. In most cases, our merging technique — the Strict Consensus Merger — outperformed MRP with respect to MP scores and running time. Divide-and-conquer techniques are also a highly competitive alternative to global heuristics such as the parsimony ratchet, especially on the more challenging data sets.

[1]  ICHAEL,et al.  Assessment of the Accuracy of Matrix Representation with Parsimony Analysis Supertree Construction , 2001 .

[2]  Peter Buneman,et al.  A characterisation of rigid circuit graphs , 1974, Discret. Math..

[3]  B. Baum Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees , 1992 .

[4]  Michael R. Fellows,et al.  Two Strikes Against Perfect Phylogeny , 1992, ICALP.

[5]  Bernard M. E. Moret,et al.  DIMACS Series in Discrete Mathematics and Theoretical Computer Science Towards a Discipline of Experimental Algorithmics , 2022 .

[6]  David Posada,et al.  MODELTEST: testing the model of DNA substitution , 1998, Bioinform..

[7]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[8]  Olaf R. P. Bininda-Emonds MRP supertree construction in the consensus setting , 2001, Bioconsensus.

[9]  R. Möhring Algorithmic graph theory and perfect graphs , 1986 .

[10]  James F. Smith Phylogenetics of seed plants : An analysis of nucleotide sequences from the plastid gene rbcL , 1993 .

[11]  W. H. Day Optimal algorithms for comparing trees with labeled leaves , 1985 .

[12]  Maria Luisa Bonet,et al.  Better methods for solving parsimony and compatibility , 1998, RECOMB '98.

[13]  M. Ragan Phylogenetic inference based on matrix representation of trees. , 1992, Molecular phylogenetics and evolution.

[14]  K. Nixon,et al.  The Parsimony Ratchet, a New Method for Rapid Parsimony Analysis , 1999, Cladistics : the international journal of the Willi Hennig Society.

[15]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[16]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[17]  F. Ayala Molecular systematics , 2004, Journal of Molecular Evolution.

[18]  B. Mishler Cladistic analysis of molecular and morphological data. , 1994, American journal of physical anthropology.

[19]  Yves Van de Peer,et al.  The European database on small subunit ribosomal RNA , 2002, Nucleic Acids Res..

[20]  Daniel H. Huson,et al.  Solving Large Scale Phylogenetic Problems using DCM2 , 1999, ISMB.

[21]  R. Graham,et al.  The steiner problem in phylogeny is NP-complete , 1982 .

[22]  P. Goloboff Analyzing Large Data Sets in Reasonable Times: Solutions for Composite Optima , 1999, Cladistics : the international journal of the Willi Hennig Society.

[23]  Tandy J. Warnow,et al.  Absolute convergence: true trees from short sequences , 2001, SODA '01.

[24]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[25]  B L Maidak,et al.  The RDP-II (Ribosomal Database Project) , 2001, Nucleic Acids Res..

[26]  Mike Steel,et al.  The Maximum Likelihood Point for a Phylogenetic Tree is Not Unique , 1994 .

[27]  J. L. Gittleman,et al.  Building large trees by combining phylogenetic information: a complete phylogeny of the extant Carnivora (Mammalia) , 1999, Biological reviews of the Cambridge Philosophical Society.

[28]  W. Kress,et al.  Angiosperm phylogeny inferred from 18S rDNA, rbcL, and atpB sequences , 2000 .

[29]  Tandy J. Warnow,et al.  Designing fast converging phylogenetic methods , 2001, ISMB.

[30]  Roderic D. M. Page,et al.  Modified Mincut Supertrees , 2002, WABI.

[31]  Tao Jiang,et al.  Quartet Cleaning: Improved Algorithms and Simulations , 1999, ESA.

[32]  Tandy J. Warnow,et al.  Sequence-Length Requirements for Phylogenetic Methods , 2002, WABI.

[33]  R. Overbeek,et al.  The winds of (evolutionary) change: breathing new life into microbiology. , 1996, Journal of bacteriology.

[34]  Michael H. Goldwasser,et al.  Data Structures, Near Neighbor Searches, and Methodology: Fifth and Sixth DIMACS Implementation Challenges, Proceedings of a DIMACS Workshop, USA, 1999 , 2002, Data Structures, Near Neighbor Searches, and Methodology.

[35]  Tandy J. Warnow,et al.  The Accuracy of Fast Phylogenetic Methods for Large Datasets , 2001, Pacific Symposium on Biocomputing.

[36]  Tandy J. Warnow,et al.  Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining , 2001, SODA '01.

[37]  Annette S. Mahon A Molecular Supertree of the Artiodactyla , 2004 .

[38]  A. Purvis A composite estimate of primate phylogeny. , 1995, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[39]  Tandy J. Warnow,et al.  A few logs suffice to build (almost) all trees (I) , 1999, Random Struct. Algorithms.

[40]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[41]  D. Ord,et al.  PAUP:Phylogenetic analysis using parsi-mony , 1993 .

[42]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[43]  M. Donoghue,et al.  Analyzing large data sets: rbcL 500 revisited. , 1997, Systematic biology.

[44]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[45]  A. Purvis,et al.  A phylogenetic supertree of the bats (Mammalia: Chiroptera) , 2002, Biological reviews of the Cambridge Philosophical Society.

[46]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[47]  O. Bininda-Emonds Phylogenetic Supertrees: Combining Information To Reveal The Tree Of Life , 2004 .

[48]  Michael M. Miyamoto,et al.  Molecular and Morphological Supertrees for Eutherian (Placental) Mammals , 2001, Science.

[49]  Jijun Tang,et al.  Scaling up accurate phylogenetic reconstruction from gene-order data , 2003, ISMB.

[50]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[51]  A. D. Gordon Consensus supertrees: The synthesis of rooted trees containing overlapping sets of labeled leaves , 1986 .