A Static Task Scheduling Framework for Independent Tasks Accelerated Using a Shared Graphics Processing Unit

The High Performance Computing (HPC) field is witnessing the increasing use of Graphics Processing Units (GPUs) as application accelerators, due to their massively data-parallel computing architectures and exceptional floating-point computational capabilities. The performance advantage from GPU-based acceleration is primarily derived for GPU computational kernels that operate on large amount of data, consuming all of the available GPU resources. For applications that consist of several independent computational tasks that do not occupy the entire GPU, sequentially using the GPU one task at a time leads to performance inefficiencies. It is therefore important for the programmer to cluster small tasks together for sharing the GPU, however, the best performance cannot be achieved through an ad-hoc grouping and execution of these tasks. In this paper, we explore the problem of GPU tasks scheduling, to allow multiple tasks to efficiently share and be executed in parallel on the GPU. We analyze factors affecting multi-tasking parallelism and performance, followed by developing the multi-tasking execution model as a performance prediction approach. The model is validated by comparing with actual execution scenarios for GPU sharing. We then present the scheduling technique and algorithm based on the proposed model, followed by experimental verifications of the proposed approach using an NVIDIA Fermi GPU computing node. Our results demonstrate significant performance improvements using the proposed scheduling approach, compared with sequential execution of the tasks under the conventional multi-tasking execution scenario.

[1]  Kevin Skadron,et al.  Enabling Task Parallelism in the CUDA Scheduler , 2009 .

[2]  David S. Johnson,et al.  Fast Algorithms for Bin Packing , 1974, J. Comput. Syst. Sci..

[3]  Proshanta Saha Automatic Software Hardware Co-Design for Reconfigurable Computing Systems , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[4]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[5]  Jürgen Teich,et al.  Maintaining Virtual Areas on FPGAs using Strip Packing with Delays , 2010, ArXiv.

[6]  Edward G. Coffman,et al.  Approximation algorithms for bin packing: a survey , 1996 .

[7]  F. Black,et al.  The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.

[8]  Hossam ElGindy,et al.  Dynamic scheduling of tasks on partially reconfigurable FPGAs , 2000 .

[9]  Vikram K. Narayana,et al.  Reconfiguration and Communication-Aware Task Scheduling for High-Performance Reconfigurable Computing , 2010, TRETS.

[10]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[11]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[12]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[13]  Rahul Mangharam,et al.  Anytime Algorithms for GPU Architectures , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.