The Multi-Processor Scheduling Problem in Phylogenetics

Advances in wet-lab sequencing techniques allow for sequencing between 100 genomes up to 1000 full transcriptomes of species whose evolutionary relationships shall be disentangled by means of phylogenetic analyses. Likelihood-based evolutionary models allow for partitioning such broad phylogenomic datasets, for instance into gene regions, for which likelihood model parameters (except for the tree itself) can be estimated independently. Present day phylogenomic datasets are typically split up into 1000-10,000 distinct partitions. While the likelihood on such datasets needs to be computed in parallel because of the high memory requirements, it has not yet been assessed how to optimally distribute partitions and/or alignment sites to processors, in particular when the number of cores is significantly smaller than the number of partitions. We find that, by distributing partitions (of varying lengths) monolithically to processors, the induced load distribution problem essentially corresponds to the well-known multiprocessor scheduling problem. By implementing the simple Longest Processing Time (LPT) heuristics in the PThreads and MPI version of RAxML-Light, we were able to accelerate run times by up to one order of magnitude. Other heuristics for multi-processor scheduling such as improved MultiFit, improved Zero-One, or the Three Phase approach did not yield notable performance improvements.

[1]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[2]  Daniel L. Ayres,et al.  BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics , 2011, Systematic biology.

[3]  Michael A. Langston,et al.  Improved 0/1-interchange scheduling , 1982, BIT.

[4]  Nirwan Ansari,et al.  Efficient multiprocessor scheduling based on genetic algorithms , 1990, [Proceedings] IECON '90: 16th Annual Conference of IEEE Industrial Electronics Society.

[5]  Alexandros Stamatakis,et al.  Phylogenetic models of rate heterogeneity: a high performance computing perspective , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[6]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[7]  Srinivas Aluru,et al.  Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[8]  Alexandros Stamatakis,et al.  Load Balance in the Phylogenetic Likelihood Kernel , 2009, 2009 International Conference on Parallel Processing.

[9]  Edward G. Coffman,et al.  An Application of Bin-Packing to Multiprocessor Scheduling , 1978, SIAM J. Comput..

[10]  Ethel Mokotoff,et al.  Production , Manufacturing and Logistics An exact algorithm for the identical parallel machine scheduling problem , 2003 .

[11]  Mauro Dell'Amico,et al.  Optimal Scheduling of Tasks on Identical Parallel Processors , 1995, INFORMS J. Comput..

[12]  Alexandros Stamatakis,et al.  Exploiting Fine-Grained Parallelism in the Phylogenetic Likelihood Function with MPI, Pthreads, and OpenMP: A Performance Study , 2008, PRIB.

[13]  E.L. Lawler,et al.  Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey , 1977 .

[14]  Lixin Tang,et al.  A new ILS algorithm for parallel machine scheduling problems , 2006, J. Intell. Manuf..

[15]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[16]  Francesca Vocaturo,et al.  A composite algorithm for multiprocessor scheduling , 2011, J. Heuristics.

[17]  Chung-Yee Lee,et al.  Multiprocessor scheduling: combining LPT and MULTIFIT , 1988, Discret. Appl. Math..

[18]  Paolamaria Pietramala,et al.  Heuristic Algorithms For Scheduling Jobs On Identical Parallel Machines Via Measures Of Spread , 2009 .

[19]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[20]  Ellis Horowitz,et al.  A linear time approximation algorithm for multiprocessor scheduling , 1979 .

[21]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .