CPU-assisted GPU thread pool model for dynamic task parallelism

With the growing power of GPUs, how to utilize the high computing performance provided by the GPU hardware becomes an urgent yet challenging problem, especially for applications with fine grained parallelism. Task programming is efficient for handling fine grained parallelism but current GPU task parallel solutions using either concurrent kernel execution (CKE) or persistent kernels suffer from a high cost of CPU-GPU interaction. The page-locked host memory supported by new generation GPUs turns CPU-GPU heterogeneous systems into the non-uniform memory access (NUMA) architecture, making it possible to improve CPU-GPU interaction with shared memory programming. In this paper, we propose the CPU-assisted GPU thread pool (CAGTP) model that combines data parallelism and task parallelism at the thread block level to support applications with fine grained parallelism. In the CAGTP model, the Computing Block Level task Scheduling (CBLS) method is designed in which task slots allocated in the page-locked host memory eliminate competition among thread blocks. A separate host scheduler is designed for scheduling tasks to thread blocks and the overhead for scheduling a task (200ns) is much lower than that of similar systems. Experiment results show that the CAGTP model supports fine grained task parallelism with or without dependencies efficiently. It outperforms CKE for batched GEMMs, Cholesky factorization and mixed workloads.

[1]  T. Steinke,et al.  On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[2]  Ioan Raicu,et al.  GEMTC: GPU Enabled Many-Task Computing , 2013 .

[3]  John D. Owens,et al.  GPU-to-CPU Callbacks , 2010, Euro-Par Workshops.

[4]  John D. Owens,et al.  A GPU Task-Parallel Model with Dependency Resolution , 2012, Computer.

[5]  Kiran Kumar Matam,et al.  CPU and/or GPU: Revisiting the GPU Vs. CPU Myth , 2013, ArXiv.

[6]  Vivek Sarkar,et al.  Dynamic Task Parallelism with a GPU Work-Stealing Runtime System , 2011, LCPC.

[7]  Frank Mueller,et al.  Hidp: A hierarchical data parallel language , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[8]  Vikram K. Narayana,et al.  A Static Task Scheduling Framework for Independent Tasks Accelerated Using a Shared Graphics Processing Unit , 2011, 2011 IEEE 17th International Conference on Parallel and Distributed Systems.

[9]  Benedict R. Gaster,et al.  Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck? , 2012, Computer.

[10]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[11]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[12]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[13]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[14]  Tarek A. El-Ghazawi,et al.  Exploiting concurrent kernel execution on graphic processing units , 2011, 2011 International Conference on High Performance Computing & Simulation.

[15]  Mateo Valero,et al.  Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[16]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[17]  Santonu Sarkar,et al.  Reuse and Refactoring of GPU Kernels to Design Complex Applications , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.