Design and analysis of scheduling strategies for multi-CPU and multi-GPU architectures

We evaluated four scheduling strategies for multi-CPU and multi-GPU architectures.We designed a framework with performance models for task and transfer prediction.Work stealing is efficient with task annotations and data locality heuristics.HEFT cost model performs better on very regular computations. In this paper, we present a comparison of scheduling strategies for heterogeneous multi-CPU and multi-GPU architectures. We designed and evaluated four scheduling strategies on top of XKaapi runtime: work stealing, data-aware work stealing, locality-aware work stealing, and Heterogeneous Earliest-Finish-Time (HEFT). On a heterogeneous architecture with 12 CPUs and 8 GPUs, we analysed our scheduling strategies with four benchmarks: a BLAS-1 AXPY vector operation, a Jacobi 2D iterative computation, and two linear algebra algorithms Cholesky and LU. We conclude that the use of work stealing may be efficient if task annotations are given along with a data locality strategy. Furthermore, our experimental results suggests that HEFT scheduling performs better on applications with very regular computations and low data locality.

[1]  Jack J. Dongarra,et al.  A scalable framework for heterogeneous GPU-based clusters , 2012, SPAA '12.

[2]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[3]  Thierry Gautier,et al.  Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[4]  Charles E. Leiserson,et al.  Space-Efficient Scheduling of Multithreaded Computations , 1998, SIAM J. Comput..

[5]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[6]  Alejandro Duran,et al.  Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[7]  Jack J. Dongarra,et al.  Dense Linear Algebra on Accelerated Multicore Hardware , 2012, High-Performance Scientific Computing.

[8]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[9]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[10]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[11]  Thierry Gautier,et al.  KAAPI: A thread scheduling runtime system for data flow computations on cluster of multi-processors , 2007, PASCO '07.

[12]  Thierry Gautier,et al.  libKOMP, an Efficient OpenMP Runtime System for Both Fork-Join and Data Flow Paradigms , 2012, IWOMP.

[13]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[14]  Eduard Ayguadé,et al.  Implementing OmpSs support for regions of data in architectures with multiple address spaces , 2013, ICS '13.

[15]  Gerson G. H. Cavalheiro,et al.  Athapascan-1: On-line building data flow graph in a parallel language , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[16]  Eduard Ayguadé,et al.  Self-Adaptive OmpSs Tasks in Heterogeneous Environments , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[17]  Laxmikant V. Kalé,et al.  G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems , 2013, ICS '13.

[18]  Laxmikant V. Kalé,et al.  Scaling Hierarchical N-body Simulations on GPU Clusters , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Cédric Augonnet,et al.  Data-Aware Task Scheduling on Multi-accelerator Based Platforms , 2010, 2010 IEEE 16th International Conference on Parallel and Distributed Systems.

[20]  Yi Guo,et al.  SLAW: A scalable locality-aware adaptive work-stealing scheduler , 2010, IPDPS.

[21]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[22]  Jesús Labarta,et al.  Parallelizing dense and banded linear algebra libraries using SMPSs , 2009, Concurr. Comput. Pract. Exp..

[23]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[24]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[25]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[26]  Thierry Gautier,et al.  X-Kaapi C programming interface , 2011 .

[27]  Thierry Gautier,et al.  A New Programming Paradigm for GPGPU , 2012, Euro-Par.

[28]  Jack Dongarra,et al.  A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[29]  Jérémie Allard,et al.  Multi-GPU and Multi-CPU Parallelization for Interactive Physics Simulations , 2010, Euro-Par.

[30]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..