Reconstructing approximate phylogenetic trees from quartet samples

The reconstruction of evolutionary trees (also known as phylogenies) is central to many problems in Biology. Accurate phylogenetic reconstruction methods are currently limited to a maximum of few dozens of species. Therefore, in order to construct a tree over larger sets of species, a method capable of inferring accurately trees over small, overlapping sets, and subsequently merging these sets into a tree over the complete set, is required. A quartet tree is the smallest informative piece of information and quartet based methods are based on combining quartet trees into a big tree. However, even this case is NP-hard, and even when the set of quartet trees is compatible (agree on a certain tree). The general problem of approximating quartets, or maximum quartet consistency (MQC), even for compatible inputs, is open for nearly twenty years. Despite its importance, the only rigorous results for approximating quartets are the naive 1/3 approximation that applies to the general case and a PTAS when the input is the complete set of all (n4) possible quartets. Even when it is possible to determine the correct quartet induced by every four taxa, the time needed to generate the complete set of all quartets may be impractical. A faster approach is to sample at random just m ≪ (n4) quartets, and provide this sample as an input. In this work we present the first approximation algorithm whose guaranteed approximation is strictly better than 1/3 when the input is any random sample of m compatible quartets. The approximation ratio we obtain is 0.425 for general m, and 0.468 when m = w(n2). An important ingredient in our algorithm involves solving a weighted Max-Cut in a certain graph induced by the set of input quartets. We also show an extension of the PTAS algorithm to handle dense, rather than complete, inputs.

[1]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[2]  P. Erdös,et al.  A few logs suffice to build (almost) all trees (l): part I , 1997 .

[3]  M. Steel The complexity of reconstructing trees from qualitative characters and subtrees , 1992 .

[4]  Tandy J. Warnow,et al.  A few logs suffice to build (almost) all trees (I) , 1999, Random Struct. Algorithms.

[5]  Elchanan Mossel,et al.  Optimal phylogenetic reconstruction , 2005, STOC '06.

[6]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[7]  Tao Jiang,et al.  A Polynomial Time Approximation Scheme for Inferring Evolutionary Trees from Quartet Topologies and Its Application , 2000, SIAM J. Comput..

[8]  Elchanan Mossel,et al.  Maximal Accurate Forests from Distance Matrices , 2006, RECOMB.

[9]  Satish Rao,et al.  Short Quartet Puzzling: A New Quartet-Based Phylogeny Reconstruction Algorithm , 2008, J. Comput. Biol..

[10]  Satish Rao,et al.  Using Max Cut to Enhance Rooted Trees Consistency , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[12]  Tao Jiang,et al.  Orchestrating quartets: approximation and data correction , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[13]  Tandy J. Warnow,et al.  A Few Logs Suffice to Build (almost) All Trees: Part II , 1999, Theor. Comput. Sci..

[14]  Sagi Snir,et al.  Fast and reliable reconstruction of phylogenetic trees with very short edges , 2008, SODA '08.

[15]  Tandy J. Warnow,et al.  Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining , 2001, SODA '01.