Data-Oriented Runtime Scheduling Framework on Multi-GPUs

GPU has been generally accepted as an efficient accelerator in the field of high performance computing (HPC). On some heterogeneous systems, multiple GPUs are installed on each computing node. To make things more complicated, these GPUs may even have different architectures. Therefore, it is a challenge to efficiently schedule tasks and data on heterogeneous system. In this paper, we present DoSFoG, a data-oriented runtime scheduling framework on heterogeneous system equipped with multiple GPUs. In DoSFoG, the data blocks, instead of tasks, are taken as the scheduling units. It uses a dataoriented directed acyclic graph (DoDAG) as representation of an application, which is proved to be equivalence to task DAG. Based on DoDAG, a runtime scheduling framework is designed. Besides, a hierarchical storage structure is carefully designed based on the various levels of memory in the system. Page-locked memory and soft cache on GPU device memory are used to improve the data transfer. DoSFoG is evaluated with different applications on a system equipped with different GPUs. The results show that DoSFoG can achieve high data locality, scalability, load balance and performance improvement for large size of data.

[1]  Enrique S. Quintana-Ortí,et al.  Reducing Energy Consumption of Dense Linear Algebra Operations on Hybrid CPU-GPU Platforms , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[2]  Zhiyang Li,et al.  Resource preprocessing and optimal task scheduling in cloud computing environments , 2015, Concurr. Comput. Pract. Exp..

[3]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[4]  Andrei Tchernykh,et al.  Multiple Workflow Scheduling Strategies with User Run Time Estimates on a Grid , 2012, Journal of Grid Computing.

[5]  R. Dolbeau,et al.  HMPP TM : A Hybrid Multi-core Parallel Programming Environment , 2022 .

[6]  Hamid Arabnejad,et al.  List Scheduling Algorithm for Heterogeneous Systems by an Optimistic Cost Table , 2014, IEEE Transactions on Parallel and Distributed Systems.

[7]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[8]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[9]  Eduard Ayguadé,et al.  SSMART: smart scheduling of multi-architecture tasks on heterogeneous systems , 2015, WACCPD '15.

[10]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[11]  Jack J. Dongarra,et al.  Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[12]  Enrique S. Quintana-Ortí,et al.  Modeling power and energy of the task-parallel Cholesky factorization on multicore processors , 2012, Computer Science - Research and Development.

[13]  Tao Li,et al.  Communication-aware task scheduling algorithm for heterogeneous computing , 2017, Int. J. High Perform. Comput. Netw..

[14]  Jean-François Méhaut,et al.  Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-core Architectures , 2014, Euro-Par.

[15]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[16]  Bronis R. de Supinski,et al.  Heterogeneous Task Scheduling for Accelerated OpenMP , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[17]  Tao Li,et al.  CPU-assisted GPU thread pool model for dynamic task parallelism , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[18]  Kenli Li,et al.  A resource-aware scheduling algorithm with reduced task duplication on heterogeneous computing systems , 2014, The Journal of Supercomputing.

[19]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[20]  Jack J. Dongarra,et al.  Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting , 2014, Concurr. Comput. Pract. Exp..