Multi-threaded Kernel Offloading to GPGPU Using Hyper-Q on Kepler Architecture

Small-scale computations usually cannot fully utilize the compute capabilities of modern GPGPUs. With the Fermi GPU architecture Nvidia introduced the concurrent kernel execution feature allowing up to 16 GPU kernels to execute simultaneously on a shared GPU device for a better utilization of the respective resources. Insufficient scheduling capabilities in this respect, however, can significantly reduce the theoretical concurrency level. With the Kepler GPU architecture Nvidia addresses this issue by introducing the Hyper-Q feature with 32 hardware managed work queues for concurrent kernel execution. We investigate the Hyper-Q feature within heterogeneous workloads with multiple concurrent host threads or processes offloading computations to the GPU each. By means of a synthetic benchmark kernel and a hybrid parallel CPU-GPU real-world application, we evaluate the performance obtained with Hyper-Q on GPU and compare it against a kernel reordering mechanism introduced by the authors for the Fermi architecture.

[1]  Tarek A. El-Ghazawi,et al.  Exploiting concurrent kernel execution on graphic processing units , 2011, 2011 International Conference on High Performance Computing & Simulation.

[2]  Joseph Zambreno,et al.  Increasing GPU throughput using kernel interleaved thread block scheduling , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[3]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[4]  T. Steinke,et al.  On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[5]  Vikram K. Narayana,et al.  Scaling scientific applications on clusters of hybrid multicore/GPU nodes , 2011, CF '11.

[6]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[7]  Long Chen,et al.  Exploring Fine-Grained Task-Based Execution on Multi-GPU Systems , 2011, 2011 IEEE International Conference on Cluster Computing.

[8]  Kevin Skadron,et al.  Enabling Task Parallelism in the CUDA Scheduler , 2009 .

[9]  Wen-mei W. Hwu,et al.  GPU Computing Gems Jade Edition , 2011 .

[10]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Wen-mei W. Hwu,et al.  GPU Computing Gems Emerald Edition , 2011 .