Author ' s personal copy A Fast Quartet tree heuristic for hierarchical clustering

The Minimum Quartet Tree Cost problem is to construct an optimal weight tree from the 3(n4) weighted quartet topologies on n objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill-climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a dendrogram, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The problem and the solution heuristic has been extensively used for general hierarchical clustering of nontree-like (non-phylogeny) data in various domains and across domains with heterogeneous data. We also present a greatly improved heuristic, reducing the running time by a factor of order a thousand to ten thousand. All this is implemented and available, as part of the CompLearn package. We compare performance and running time of the original and improved versions with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized.

[1]  Tao Jiang,et al.  Quartet Cleaning: Improved Algorithms and Simulations , 1999, ESA.

[2]  Sio Iong Ao,et al.  CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs , 2005, Bioinform..

[3]  Rudi Cilibrasi,et al.  Statistical inference through data compression , 2007 .

[4]  Jan H. M. Korst,et al.  Heuristic Approaches for the Quartet Method of Hierarchical Clustering , 2010, IEEE Transactions on Knowledge and Data Engineering.

[5]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[6]  D. Huson,et al.  Application of phylogenetic networks in evolutionary studies. , 2006, Molecular biology and evolution.

[8]  Thomas R. Buckley,et al.  Marsupials and Eutherians reunited: genetic evidence for the Theria hypothesis of mammalian evolution , 2001, Mammalian Genome.

[9]  S. Pääbo,et al.  Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders , 1998, Journal of Molecular Evolution.

[10]  Bart Braden Calculating Sums of Infinite Series , 1992 .

[11]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[12]  Eamonn J. Keogh,et al.  Towards parameter-free data mining , 2004, KDD.

[13]  Luis Filipe Coelho Antunes,et al.  Clustering Fetal Heart Rate Tracings by Compression , 2006, 19th IEEE Symposium on Computer-Based Medical Systems (CBMS'06).

[14]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[15]  P. Laplace A Philosophical Essay On Probabilities , 1902 .

[16]  J. A. Comer,et al.  A novel coronavirus associated with severe acute respiratory syndrome. , 2003, The New England journal of medicine.

[17]  Alexander Kraskov,et al.  Hierarchical Clustering Based on Mutual Information , 2003, ArXiv.

[18]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[19]  Stephanie Wehner,et al.  Analyzing worms and network traffic using compression , 2005, J. Comput. Secur..

[20]  Bernard M. E. Moret,et al.  Performance of Supertree Methods on Various Data Set Decompositions , 2004 .

[21]  Dan Pelleg,et al.  Constructing Phylogenies from Quartets: Elucidation of Eutherian Superordinal Relationships , 1998, J. Comput. Biol..

[22]  O. Bininda-Emonds Phylogenetic Supertrees: Combining Information To Reveal The Tree Of Life , 2004 .

[23]  Paul E. Kearney,et al.  The ordinal quartet method , 1998, RECOMB '98.

[24]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[25]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[26]  Jijun Tang,et al.  Quartet methods for phylogeny reconstruction from gene orders , 2005 .

[27]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[28]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[29]  J. Kruskal Nonmetric multidimensional scaling: A numerical method , 1964 .

[30]  H. Colonius,et al.  Tree structures for proximity data , 1981 .

[31]  Ronald de Wolf,et al.  Algorithmic Clustering of Music Based on String Compression , 2004, Computer Music Journal.

[32]  Oliver Eulenstein,et al.  Quartet Supertrees , 2004 .

[33]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[34]  Tao Jiang,et al.  A Polynomial Time Approximation Scheme for Inferring Evolutionary Trees from Quartet Topologies and Its Application , 2001, SIAM J. Comput..

[35]  Péter Gács,et al.  Information Distance , 1998, IEEE Trans. Inf. Theory.

[36]  Xin Chen,et al.  Shared information and program plagiarism detection , 2004, IEEE Transactions on Information Theory.

[37]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[38]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[39]  Stefan Grünewald,et al.  Integer linear programming as a tool for constructing trees from quartet data , 2005, Comput. Biol. Chem..

[40]  John R. Koza,et al.  Hierarchical Genetic Algorithms Operating on Populations of Computer Programs , 1989, IJCAI.

[41]  Paul M. B. Vitányi A discipline of evolutionary programming , 2000, Theor. Comput. Sci..

[42]  Satish Rao,et al.  Short Quartet Puzzling: A New Quartet-Based Phylogeny Reconstruction Algorithm , 2008, J. Comput. Biol..

[43]  Sergey Bereg,et al.  Clustered SplitsNetworks , 2008, COCOA.

[44]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[45]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[46]  Marcus Hutter,et al.  Algorithmic Complexity , 1993 .

[47]  David G. Stork,et al.  Pattern Classification , 1973 .

[48]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[49]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[50]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[51]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[52]  Bin Ma,et al.  Chain letters & evolutionary histories. , 2003, Scientific American.

[53]  Xin Chen,et al.  An information-based sequence distance and its application to whole mitochondrial genome phylogeny , 2001, Bioinform..