A collapsing method for the efficient recovery of optimal edges in phylogenetic trees

As the amount of sequencing efforts and genomic data volume continue to increase at an accelerated rate, phylogenetic analysis provides an evolutionary context for understanding and interpreting this growing set of complex data. We introduce a novel quartet based method for inferring molecular based phylogeny called hypercleaning* (HC*). The HC* method is based on the hypercleaning (HC) technique, which possesses an interesting property of recovering edges (of a phylogenetic tree) that are best supported by the witness quartet set. HC* extends HC in two regards: i) whereas HC constrains the input quartet set to be unweighted (binary valued), HC* allows any positive valued quartet scores, enabling more informative quartets to be defined. ii) HC* employs a novel collapsing technique which significantly speeds up the inference stage, making it empirically on par with quartet puzzling in terms of speed, while still guaranteeing optimal edge recovery as in HC. This paper is primarily aimed at presenting the algorithmic construction of HC*. We also report some preliminary studies on an implementation of HC* as a potentially powerful approximation scheme for maximum likelihood based inference.

[1]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[2]  D. Sankoff,et al.  Gene Order Breakpoint Evidence in Animal Mitochondrial Phylogeny , 1999, Journal of Molecular Evolution.

[3]  J. Huelsenbeck,et al.  Hobgoblin of phylogenetics? , 1994, Nature.

[4]  D Penny,et al.  Parsimony, likelihood, and the role of models in molecular phylogenetics. , 2000, Molecular biology and evolution.

[5]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[6]  Ross A. Overbeek,et al.  The RDP (Ribosomal Database Project) , 1997, Nucleic Acids Res..

[7]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[8]  M. Hasegawa,et al.  Comment on the Quartet Puzzling Method for Finding Maximum-Likelihood Tree Topologies , 1998 .

[9]  Dannie Durand,et al.  Notung: dating gene duplications using gene family trees , 2000, RECOMB '00.

[10]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[11]  Tao Jiang,et al.  A Polynomial Time Approximation Scheme for Inferring Evolutionary Trees from Quartet Topologies and Its Application , 2001, SIAM J. Comput..

[12]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[13]  D. Hillis Approaches for Assessing Phylogenetic Accuracy , 1995 .

[14]  Brendan J. Frey,et al.  Graphical Models for Machine Learning and Digital Communication , 1998 .

[15]  Junhyong Kim,et al.  Large-scale phylogenies and measuring the performance of phylogenetic estimators. , 1998, Systematic biology.

[16]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[17]  David G. Stork,et al.  Pattern Classification , 1973 .

[18]  Mathieu Blanchette,et al.  Algorithms for phylogenetic footprinting , 2001, RECOMB.

[19]  Tandy J. Warnow,et al.  Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining , 2001, SODA '01.

[20]  F. Rohlf Consensus indices for comparing classifications , 1982 .

[21]  Michael T. Hallett,et al.  Efficient algorithms for lateral gene transfer problems , 2001, RECOMB.

[22]  Ziheng Yang,et al.  Maximum Likelihood Analysis of Adaptive Evolution in HIV-1 Gp120 env Gene , 2000, Pacific Symposium on Biocomputing.

[23]  Maryse Condé Tree of Life , 1992 .

[24]  Hideo Matsuda,et al.  fastDNAmL: a tool for construction of phylogenetic trees of DNA sequences using maximum likelihood , 1994, Comput. Appl. Biosci..

[25]  O. Gascuel,et al.  Quartet-based phylogenetic inference: improvements and limits. , 2001, Molecular biology and evolution.

[26]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[27]  Joseph B. Kruskal,et al.  Time Warps, String Edits, and Macromolecules , 1999 .

[28]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[29]  M. Nei,et al.  Molecular Evolution and Phylogenetics , 2000 .

[30]  Martin Vingron,et al.  Phylogeny meets sequence search , 1999, German Conference on Bioinformatics.

[31]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[32]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..

[33]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[34]  A. Halpern,et al.  Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. , 2000, Molecular biology and evolution.

[35]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[36]  K Lange,et al.  Computational advances in maximum likelihood methods for molecular phylogeny. , 1998, Genome research.

[37]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[38]  Sudhir Kumar,et al.  Incomplete taxon sampling is not a problem for phylogenetic inference , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Adam Eyre-Walker,et al.  Molecular Evolution by Wen-Hsiung Li. Published by Sinauer Associates, Sunderland, MA, USA. ISBN: 0-87893-463-4 (cloth). , 1997 .

[40]  A. Graybeal,et al.  Is it better to add taxa or characters to a difficult phylogenetic problem? , 1998, Systematic biology.

[41]  Paul E. Kearney,et al.  The ordinal quartet method , 1998, RECOMB '98.

[42]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[43]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[44]  Dale Schuurmans,et al.  Boosting in the Limit: Maximizing the Margin of Learned Ensembles , 1998, AAAI/IAAI.

[45]  Michael D. Hendy,et al.  A Framework for the Quantitative Study of Evolutionary Trees , 1989 .

[46]  Tao Jiang,et al.  Orchestrating quartets: approximation and data correction , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[47]  Tal Pupko,et al.  A Structural EM Algorithm for Phylogenetic Inference , 2002, J. Comput. Biol..

[48]  Paul E. Kearney,et al.  Picking fruit from the tree of life: comments on taxonomic sampling and quartet methods , 2001, SAC.

[49]  Michael J. Stanhope,et al.  Phylogenetic analyses do not support horizontal gene transfers from bacteria to vertebrates , 2001, Nature.

[50]  David Sankoff,et al.  Reconstructing the pre-doubling genome , 1999, RECOMB.

[51]  Benno Schwikowski,et al.  Algorithms for Phylogenetic Footprinting , 2002, J. Comput. Biol..

[52]  J. S. Rogers,et al.  A fast method for approximating maximum likelihoods of phylogenetic trees from nucleotide sequences. , 1998, Systematic biology.

[53]  W. A. Beyer,et al.  Additive evolutionary trees. , 1977, Journal of theoretical biology.

[54]  Michael Q. Zhang,et al.  Current Topics in Computational Molecular Biology , 2002 .