High-Performance Algorithm Engineering for Computational Phylogenetics

A phylogeny is the evolutionary history of a group of organisms; systematists (and other biologists) attempt to reconstruct this history from various forms of data about contemporary organisms. Phylogeny reconstruction is a crucial step in the understanding of evolution as well as an important tool in biological, pharmaceutical, and medical research. Phylogeny reconstruction from molecular data is very difficult: almost all optimization models give rise to NP-hard (and thus computationally intractable) problems. Yet approximations must be of very high quality in order to avoid outright biological nonsense. Thus many biologists have been willing to run farms of processors for many months in order to analyze just one dataset. High-performance algorithm engineering offers a battery of tools that can reduce, sometimes spectacularly, the running time of existing phylogenetic algorithms, as well as help designers produce better algorithms. We present an overview of algorithm engineering techniques, illustrating them with an application to the “breakpoint analysis” method of Sankoff et al., which resulted in the GRAPPA software suite. GRAPPA demonstrated a speedup in running time by over eight orders of magnitude over the original implementation on a variety of real and simulated datasets. We show how these algorithmic engineering techniques are directly applicable to a large variety of challenging combinatorial problems in computational biology.

[1]  J. Chambers,et al.  Neuromedin U Is a Potent Agonist at the Orphan G Protein-coupled Receptor FM3* , 2000, The Journal of Biological Chemistry.

[2]  Breakpoint Phylogenies. , 1997, Genome informatics. Workshop on Genome Informatics.

[3]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[4]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[5]  Bernard M. E. Moret,et al.  DIMACS Series in Discrete Mathematics and Theoretical Computer Science Towards a Discipline of Experimental Algorithmics , 2022 .

[6]  Henry D. Shapiro,et al.  Algorithms from P to NP (vol. 1): design and efficiency , 1991 .

[7]  Richard E. Ladner,et al.  The influence of caches on the performance of heaps , 1996, JEAL.

[8]  E. Grossbard,et al.  The herbicide glyphosate. , 1985 .

[9]  Li Xiao,et al.  Improving memory performance of sorting algorithms , 2000, JEAL.

[10]  Kurt Mehlhorn,et al.  The LEDA Platform of Combinatorial and Geometric Computing , 1997, ICALP.

[11]  Bernard M. E. Moret,et al.  New approaches for reconstructing phylogenies based on gene order , 2001 .

[12]  Tandy J. Warnow,et al.  A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data , 2000, ISMB.

[13]  Andrew V. Goldberg,et al.  Cut Tree Algorithms: An Experimental Study , 2001, J. Algorithms.

[14]  Bernard M. E. Moret,et al.  An Empirical Comparison of Phylogenetic Methods on Chloroplast Gene Order Data in Campanulaceae , 2000 .

[15]  David A. Bader,et al.  A New Implmentation and Detailed Study of Breakpoint Analysis , 2000, Pacific Symposium on Biocomputing.

[16]  Andrew V. Goldberg,et al.  Augment or push: a computational study of bipartite matching and unit-capacity flow algorithms , 1998, JEAL.

[17]  Tandy J. Warnow,et al.  Reconstructing Optimal Phylogenetic Trees: A Challenge in Experimental Algorithmics , 2000, Experimental Algorithmics.

[18]  Michael A. Bender,et al.  Cache-oblivious B-trees , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[19]  Andrew V. Goldberg,et al.  On Implementing the Push—Relabel Method for the Maximum Flow Problem , 1997, Algorithmica.

[20]  Douglas W. Jones,et al.  An empirical comparison of priority-queue and event-set implementations , 1986, CACM.

[21]  Linda A. Raubeson,et al.  Chloroplast DNA Evidence on the Ancient Evolutionary Split in Vascular Land Plants , 1992, Science.

[22]  David A. Bader,et al.  A detailed study of breakpoint analysis , 2001 .

[23]  David A. Bader,et al.  A fast linear-time algorithm for inversion distance with an experimental comparison , 2001 .

[24]  Richard E. Ladner,et al.  The influence of caches on the performance of sorting , 1997, SODA '97.

[25]  Richard E. Ladner,et al.  Cache performance analysis of traversals and random accesses , 1999, SODA '99.

[26]  C. M. Davenport,et al.  Cloning, expression, and pharmacological characterization of a novel human histamine receptor. , 2001, Molecular pharmacology.

[27]  David Sankoff,et al.  Multiple Genome Rearrangement and Breakpoint Phylogeny , 1998, J. Comput. Biol..

[28]  Ron Shamir,et al.  The median problems for breakpoints are NP-complete , 1998, Electron. Colloquium Comput. Complex..

[29]  Henry D. Shapiro,et al.  An Empirical Assessment of Algorithms for Constructing a Minimum Spanning Tree , 1992, Computational Support for Discrete Mathematics.

[30]  Henry D. Shapiro,et al.  Algorithms and Experiments: The New (and Old) Methodology , 2001, J. Univers. Comput. Sci..

[31]  John T. Stasko,et al.  Pairing heaps: experiments and analysis , 1987, CACM.

[32]  M. Donoghue,et al.  Analyzing large data sets: rbcL 500 revisited. , 1997, Systematic biology.

[33]  J. Palmer,et al.  Chloroplast DNA systematics: a review of methods and data analysis , 1994 .

[34]  Bernard M. E. Moret,et al.  A new fast heuristic for computing the breakpoint phylogeny and a phylogenetic analysis of a group of highly rearranged chloroplast genomes , 2000, ISMB 2000.

[35]  Naila Rahman,et al.  Analysing Cache Effects in Distribution Sorting , 1999, Algorithm Engineering.

[36]  Jeffrey Scott Vitter,et al.  Efficient sorting using registers and caches , 2000, JEAL.

[37]  David A. Bader,et al.  A Linear-Time Algorithm for Computing Inversion Distance between Signed Permutations with an Experimental Study , 2001, J. Comput. Biol..

[38]  David A. Bader,et al.  Industrial applications of high-performance computing for phylogeny reconstruction , 2001, SPIE ITCom.

[39]  David A. Bader,et al.  GRAPPA runs in record time , 2000 .

[40]  Michael Rodeh,et al.  Matrix multiplication: a case study of enhanced data cache utilization , 1999, JEAL.

[41]  Tandy J. Warnow,et al.  Estimating true evolutionary distances between genomes , 2001, STOC '01.

[42]  David A. Bader,et al.  High-Performance Algorithm Engineering for Computational Phylogenetics , 2001, International Conference on Computational Science.

[43]  David A. Bader,et al.  High-performance algorithm engineering for parallel computation , 2002 .

[44]  Peter Sanders Fast Priority Queues for Cached Memory , 1999, ALENEX.

[45]  David S. Johnson,et al.  8. The traveling salesman problem: a case study , 2003 .

[46]  Andrew V. Goldberg,et al.  Shortest paths algorithms: Theory and experimental evaluation , 1994, SODA '94.