论文信息 - Computation and Communication Aware task graph Scheduling on multi-GPU systems

Computation and Communication Aware task graph Scheduling on multi-GPU systems

GPUs have emerged as popular throughput computing platforms due to the massively parallel computing capability and low cost. To attain further performance enhancement beyond single GPU, there is a growing interest in exploiting systems with multiple GPUs. Attaining superior performance in a multi-GPU system involves three main design challenges, namely load balance, memory utilization, and data transfer. Imbalanced loading across a system could cause idling of GPUs while poor data reuse would trigger excessive memory accesses. The inefficient data transfer between a host and a device becomes a considerable performance overhead during high throughput computing. This paper aims at addressing the above design issues by proposing a Computation and Communication Aware task graph Scheduling (CCAS) for multi-GPU systems. The proposed scheduling approach (CCAS) adopts an effective heuristic algorithm that considers both data reuse and load balance in a multi-GPU system. The data transfer overhead is hidden by extensively overlapping computation and data communication. The experimental results of the proposed CCAS have demonstrated an average of 22.15% performance enhancement when compared with a previous work.

Bo-Cheng Lai | Yun-Ting Wang | Jia-Ying Lee

[1] Wayne H. Wolf,et al. TGFF: task graphs for free , 1998, Proceedings of the Sixth International Workshop on Hardware/Software Codesign. (CODES/CASHE'98).

[2] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[3] Bo-Cheng Lai,et al. Memory capacity aware non-blocking data transfer on GPGPU , 2013, SiPS 2013 Proceedings.

[4] Salim Hariri,et al. Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[5] Christian Haubelt,et al. A SystemC-Based Design Methodology for Digital Signal Processing Systems , 2007, EURASIP J. Embed. Syst..

[6] Long Chen,et al. Exploring Fine-Grained Task-Based Execution on Multi-GPU Systems , 2011, 2011 IEEE International Conference on Cluster Computing.

[7] Thomas R. Gross,et al. Exploiting task and data parallelism on a multicomputer , 1993, PPOPP '93.

[8] Anand Raghunathan,et al. A framework for efficient and scalable execution of domain-specific templates on GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.