Computation and Communication Aware task graph Scheduling on multi-GPU systems

GPUs have emerged as popular throughput computing platforms due to the massively parallel computing capability and low cost. To attain further performance enhancement beyond single GPU, there is a growing interest in exploiting systems with multiple GPUs. Attaining superior performance in a multi-GPU system involves three main design challenges, namely load balance, memory utilization, and data transfer. Imbalanced loading across a system could cause idling of GPUs while poor data reuse would trigger excessive memory accesses. The inefficient data transfer between a host and a device becomes a considerable performance overhead during high throughput computing. This paper aims at addressing the above design issues by proposing a Computation and Communication Aware task graph Scheduling (CCAS) for multi-GPU systems. The proposed scheduling approach (CCAS) adopts an effective heuristic algorithm that considers both data reuse and load balance in a multi-GPU system. The data transfer overhead is hidden by extensively overlapping computation and data communication. The experimental results of the proposed CCAS have demonstrated an average of 22.15% performance enhancement when compared with a previous work.

[1]  Wayne H. Wolf,et al.  TGFF: task graphs for free , 1998, Proceedings of the Sixth International Workshop on Hardware/Software Codesign. (CODES/CASHE'98).

[2]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[3]  Bo-Cheng Lai,et al.  Memory capacity aware non-blocking data transfer on GPGPU , 2013, SiPS 2013 Proceedings.

[4]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[5]  Christian Haubelt,et al.  A SystemC-Based Design Methodology for Digital Signal Processing Systems , 2007, EURASIP J. Embed. Syst..

[6]  Long Chen,et al.  Exploring Fine-Grained Task-Based Execution on Multi-GPU Systems , 2011, 2011 IEEE International Conference on Cluster Computing.

[7]  Thomas R. Gross,et al.  Exploiting task and data parallelism on a multicomputer , 1993, PPOPP '93.

[8]  Anand Raghunathan,et al.  A framework for efficient and scalable execution of domain-specific templates on GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.