Scheduling Independent Moldable Tasks on Multi-Cores with GPUs

We present a new approach for scheduling independent tasks on multiple CPUs and multiple GPUs. The tasks are assumed to be parallelizable on CPUs using the moldable model: the final number of cores allotted to a task can be decided and set by the scheduler. More precisely, we design an algorithm aiming at minimizing the makespan—the maximum completion time of all tasks—for this scheduling problem. The proposed algorithm combines a dual approximation scheme with a fast integer linear program (ILP). It determines both the partitioning of the tasks, i.e., whether a task should be mapped to CPUs or a GPU, and the number of CPUs allotted to a moldable task if mapped to the CPUs. A worst-case analysis shows that the algorithm has an approximation ratio of <inline-formula><tex-math notation="LaTeX"> $\frac{3}{2} + \epsilon$</tex-math><alternatives><inline-graphic xlink:href="trystram-ieq1-2675891.gif"/></alternatives> </inline-formula>. Since the time complexity of the ILP-based algorithm could be non-polynomial, we also present a polynomial-time algorithm with an approximation ratio of <inline-formula><tex-math notation="LaTeX">$2+\epsilon$ </tex-math><alternatives><inline-graphic xlink:href="trystram-ieq2-2675891.gif"/></alternatives></inline-formula>. We complement the theoretical analysis of our two novel algorithms with a simulation study. In these simulations, we compare our algorithms to a modified version of the classical HEFT algorithm, which we adapted to handle moldable tasks. The simulation results show that our algorithm with the <inline-formula><tex-math notation="LaTeX"> $\left(\frac{3}{2} + \epsilon \right)$</tex-math><alternatives><inline-graphic xlink:href="trystram-ieq3-2675891.gif"/> </alternatives></inline-formula>-approximation ratio produces significantly shorter schedules than the modified HEFT for most of the instances. In addition, our results provide evidence that our ILP-based algorithm can solve larger problem instances in a reasonable amount of time.

[1]  Alan Edelman,et al.  Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[2]  Jan Karel Lenstra,et al.  Approximation algorithms for scheduling unrelated parallel machines , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[3]  Ronald L. Graham,et al.  Bounds for Multiprocessor Scheduling with Resource Constraints , 1975, SIAM J. Comput..

[4]  Éva Tardos,et al.  An approximation algorithm for the generalized assignment problem , 1993, Math. Program..

[5]  Nodari Vakhania,et al.  An optimal rounding gives a better approximation for scheduling unrelated machines , 2005, Oper. Res. Lett..

[6]  David B. Shmoys,et al.  Using dual approximation algorithms for scheduling problems: Theoretical and practical results , 1985, 26th Annual Symposium on Foundations of Computer Science (sfcs 1985).

[7]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Vincenzo Bonifaci,et al.  Scheduling Unrelated Machines of Few Different Types , 2012, ArXiv.

[9]  Emmanuel Agullo,et al.  QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Azzedine Boukerche,et al.  A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space , 2010, IEEE Transactions on Computers.

[11]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[12]  Philip S. Yu,et al.  Approximate algorithms scheduling parallelizable tasks , 1992, SPAA '92.

[13]  Dror G. Feitelson,et al.  On Simulation and Design of Parallel-Systems Schedulers: Are We Doing the Right Thing? , 2009, IEEE Transactions on Parallel and Distributed Systems.

[14]  Hiroshi Sasaki,et al.  Power and Performance Analysis of GPU-Accelerated Systems , 2012, HotPower.

[15]  Sascha Hunold,et al.  One step toward bridging the gap between theory and practice in moldable task scheduling with precedence constraints , 2015, Concurr. Comput. Pract. Exp..

[16]  Klaus Jansen,et al.  Linear-Time Approximation Schemes for Scheduling Malleable Parallel Tasks , 1999, SODA '99.

[17]  Zhiyong Liu,et al.  An effective approximation algorithm for the Malleable Parallel Task Scheduling problem , 2012, J. Parallel Distributed Comput..

[18]  Denis Trystram,et al.  Dynamic Load Balancing for Ocean Circulation Model with Adaptive Meshing , 1999, Euro-Par.

[19]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[20]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[21]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[22]  Robert E. Tarjan,et al.  Performance Bounds for Level-Oriented Two-Dimensional Packing Algorithms , 1980, SIAM J. Comput..

[23]  Prasoon Tiwari,et al.  Scheduling malleable and nonmalleable parallel tasks , 1994, SODA '94.

[24]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[25]  A. Steinberg,et al.  A Strip-Packing Algorithm with Absolute Performance Bound 2 , 1997, SIAM J. Comput..

[26]  Klaus Jansen,et al.  A Fast 5/2-Approximation Algorithm for Hierarchical Scheduling , 2010, Euro-Par.

[27]  Denis Trystram,et al.  A 3/2-Approximation Algorithm for Scheduling Independent Monotonic Malleable Tasks , 2007, SIAM J. Comput..

[28]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[29]  Florence Monna Scheduling for new computing platforms with GPUs , 2015 .

[30]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.