Reflections on kernelizing and computing unrooted agreement forests

Phylogenetic trees are leaf-labelled trees used to model the evolution of species. Here we explore the practical impact of kernelization (i.e. data reduction) on the NP-hard problem of computing the TBR distance between two unrooted binary phylogenetic trees. This problem is better-known in the literature as the maximum agreement forest problem, where the goal is to partition the two trees into a minimum number of common, non-overlapping subtrees. We have implemented two well-known reduction rules, the subtree and chain reduction, and five more recent, theoretically stronger reduction rules, and compare the reduction achieved with and without the stronger rules. We find that the new rules yield smaller reduced instances and thus have clear practical added value. In many cases they also cause the TBR distance to decrease in a controlled fashion, which can further facilitate solving the problem in practice. Next, we compare the achieved reduction to the known worst-case theoretical bounds of $$15k-9$$ 15 k - 9 and $$11k-9$$ 11 k - 9 respectively, on the number of leaves of the two reduced trees, where k is the TBR distance, observing in both cases a far larger reduction in practice. As a by-product of our experimental framework we obtain a number of new insights into the actual computation of TBR distance. We find, for example, that very strong lower bounds on TBR distance can be obtained efficiently by randomly sampling certain carefully constructed partitions of the leaf labels, and identify instances which seem particularly challenging to solve exactly. The reduction rules have been implemented within our new solver Tubro which combines kernelization with an Integer Linear Programming (ILP) approach. Tubro also incorporates a number of additional features, such as a cluster reduction and a practical upper-bounding heuristic, and it can leverage combinatorial insights emerging from the proofs of correctness of the reduction rules to simplify the ILP.

[1]  M. Steel,et al.  Subtree Transfer Operations and Their Induced Metrics on Evolutionary Trees , 2001 .

[2]  T. Turner Phylogenetics , 2018, The International Encyclopedia of Biological Anthropology.

[3]  Colin McDiarmid,et al.  Extremal Distances for Subtree Transfer Operations in Binary Trees , 2015, Annals of Combinatorics.

[4]  Michael R. Fellows,et al.  Fundamentals of Parameterized Complexity , 2013 .

[5]  Jianer Chen,et al.  Parameterized and approximation algorithms for maximum agreement forest in multifurcating trees , 2015, Theor. Comput. Sci..

[6]  N. Zeh,et al.  Supertrees Based on the Subtree Prune-and-Regraft Distance , 2014, Systematic biology.

[7]  R. Steele,et al.  Optimization , 2005, Encyclopedia of Biometrics.

[8]  Norbert Zeh,et al.  Fixed-Parameter Algorithms for Maximum Agreement Forests , 2011, SIAM J. Comput..

[9]  Chris Whidden,et al.  Calculating the Unrooted Subtree Prune-and-Regraft Distance , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Kenji Fukumizu,et al.  Multilocus phylogenetic analysis with gene tree clustering , 2015, Ann. Oper. Res..

[11]  Michael R. Fellows,et al.  What Is Known About Vertex Cover Kernelization? , 2018, Adventures Between Lower Bounds and Higher Altitudes.

[12]  Norbert Zeh,et al.  Computing Maximum Agreement Forests without Cluster Partitioning is Folly , 2017, ESA.

[13]  Jeremy M. Brown,et al.  Variation Across Mitochondrial Gene Trees Provides Evidence for Systematic Error: How Much Gene Tree Variation Is Biological? , 2018, Systematic biology.

[14]  Steven Kelk,et al.  New Reduction Rules for the Tree Bisection and Reconnection Distance , 2019, Annals of Combinatorics.

[15]  Steven Kelk,et al.  On the Complexity of Computing MP Distance Between Binary Phylogenetic Trees , 2014, ArXiv.

[16]  Tao Jiang,et al.  On the Complexity of Comparing Evolutionary Trees , 1996, Discret. Appl. Math..

[17]  E. Harding The probabilities of rooted tree-shapes generated by random bifurcation , 1971, Advances in Applied Probability.

[18]  Darren Strash,et al.  Engineering Kernelization for Maximum Cut , 2019, ALENEX.

[19]  Steven Kelk,et al.  Reduction rules for the maximum parsimony distance on phylogenetic trees , 2015, Theor. Comput. Sci..

[20]  Glenn Hickey,et al.  SPR Distance Computation for Unrooted Trees , 2008, Evolutionary bioinformatics online.

[21]  Magnus Bordewich,et al.  On the fixed parameter tractability of agreement-based phylogenetic distances , 2017, Journal of mathematical biology.

[22]  Daniel H. Huson,et al.  Phylogenetic Networks - Concepts, Algorithms and Applications , 2011 .

[23]  Rolf Niedermeier,et al.  Experiments on data reduction for optimal domination in networks , 2006, Ann. Oper. Res..

[24]  Monika Henzinger,et al.  Shared-Memory Branch-and-Reduce for Multiterminal Cuts , 2019, ALENEX.

[25]  Michal Pilipczuk,et al.  Parameterized Algorithms , 2015, Springer International Publishing.

[26]  Leo van Iersel,et al.  Kernelizations for the hybridization number problem on multiple nonbinary trees , 2013, J. Comput. Syst. Sci..

[27]  Rolf Niedermeier,et al.  The Power of Linear-Time Data Reduction for Maximum Matching , 2020, Algorithmica.

[28]  Steven Kelk,et al.  A note on convex characters, Fibonacci numbers and exponential-time algorithms , 2017, Adv. Appl. Math..

[29]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[30]  Fedor V. Fomin,et al.  Kernelization: Theory of Parameterized Preprocessing , 2019 .

[31]  Vincent Moulton,et al.  A parsimony-based metric for phylogenetic trees , 2015, Adv. Appl. Math..

[32]  Steven Kelk,et al.  On the Maximum Parsimony Distance Between Phylogenetic Trees , 2014, Annals of Combinatorics.

[33]  M. Kuhner,et al.  Practical performance of tree comparison metrics. , 2015, Systematic biology.

[34]  Vasek Chvátal,et al.  A Greedy Heuristic for the Set-Covering Problem , 1979, Math. Oper. Res..

[35]  Yufeng Wu,et al.  A practical method for exact computation of subtree prune and regraft distance , 2009, Bioinform..

[36]  Steven Kelk,et al.  A tight kernel for computing the tree bisection and reconnection distance between two phylogenetic trees , 2018, SIAM J. Discret. Math..