Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Graphics processors, or GPUs, have recently been widely used as accelerators in shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an important metric for performance and total ownership cost. Despite recently improved runtime support for concurrent GPU kernel executions, the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system to improve the throughput of concurrent kernel executions on the GPU. Kernelet embraces transparent memory management and PCI-e data transfer techniques, and dynamic slicing and scheduling techniques for kernel executions. With slicing, Kernelet divides a GPU kernel into multiple sub-kernels (namely slices ). Each slice has tunable occupancy to allow co-scheduling with other slices for high GPU utilization. We develop a novel Markov chain-based performance model to guide the scheduling decision. Our experimental results demonstrate up to 31 percent and 23 percent performance improvement on NVIDIA Tesla C2050 and GTX680 GPUs, respectively.

[1]  Vivek Sarkar,et al.  Linear scan register allocation , 1999, TOPL.

[2]  J. Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Federico Silla,et al.  rCUDA: Reducing the number of GPU-based accelerators in high performance clusters , 2010, 2010 International Conference on High Performance Computing & Simulation.

[4]  Jie Chen,et al.  Analysis and approximation of optimal co-scheduling on Chip Multiprocessors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Kevin Skadron,et al.  Enabling Task Parallelism in the CUDA Scheduler , 2009 .

[6]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[7]  Bingsheng He,et al.  Relational query coprocessing on graphics processors , 2009, TODS.

[8]  Stijn Eyerman,et al.  Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.

[9]  S BaghsorkhiSara,et al.  Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012 .

[10]  Norbert Luttenberger,et al.  Efficiently Using a CUDA-enabled GPU as Shared Resource , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[11]  Jianlong Zhong,et al.  Parallel Graph Processing on Graphics Processors Made Easy , 2013, Proc. VLDB Endow..

[12]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[13]  Vanish Talwar,et al.  GViM: GPU-accelerated virtual machines , 2009, HPCVirt '09.

[14]  Michael Stumm,et al.  Thread clustering: sharing-aware scheduling on SMP-CMP-SMT multiprocessors , 2007, EuroSys '07.

[15]  Wen-mei W. Hwu,et al.  Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.

[16]  Michael Garland,et al.  Understanding throughput-oriented architectures , 2010, Commun. ACM.

[17]  Stijn Eyerman,et al.  Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS 2010.

[18]  Dinesh Manocha,et al.  GPUTeraSort: high performance graphics co-processor sorting for large database management , 2006, SIGMOD Conference.

[19]  Burton J. Smith,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Jianlong Zhong,et al.  Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[21]  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[23]  Kevin Skadron,et al.  Fine-grained resource sharing for concurrent GPGPU kernels , 2012, HotPar'12.

[24]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[25]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[26]  William Stallings,et al.  Operating Systems - Internals and Design Principles (7th ed.) , 2001 .

[27]  Bingsheng He,et al.  Mars: Accelerating MapReduce with Graphics Processors , 2011, IEEE Transactions on Parallel and Distributed Systems.

[28]  Dean M. Tullsen,et al.  Symbiotic jobscheduling with priorities for a simultaneous multithreading processor , 2002, SIGMETRICS '02.

[29]  A. Snavely,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[30]  Dirk Grunwald,et al.  Methods for modeling resource contention on simultaneous multithreading processors , 2005, 2005 International Conference on Computer Design.

[31]  William Stallings,et al.  Operating Systems: Internals and Design Principles , 1991 .

[32]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[33]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[34]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[35]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[36]  Susan J. Eggers,et al.  Thread-Sensitive Scheduling for SMT Processors , 2000 .

[37]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[38]  Mauricio J. Serrano,et al.  A Model for Performance Estimation in a Multistreamed Superscalar Processor , 1994, Computer Performance Evaluation.

[39]  T. Steinke,et al.  On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[40]  Jack J. Dongarra,et al.  Optimizing symmetric dense matrix-vector multiplication on GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[41]  M. J. Serrano Performance estimation in a simultaneous multithreading processor , 1996, Proceedings of MASCOTS '96 - 4th International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[42]  Shinpei Kato,et al.  RGEM: A Responsive GPGPU Execution Model for Runtime Engines , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[43]  Yao Zhang,et al.  A quantitative performance analysis model for GPU architectures , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[44]  Tor M. Aamodt,et al.  A first-order fine-grained multithreaded throughput model , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[45]  Gregory J. Chaitin,et al.  Register allocation & spilling via graph coloring , 1982, SIGPLAN '82.

[46]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[47]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[48]  Srimat T. Chakradhar,et al.  Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework , 2011, HPDC '11.

[49]  Vikram K. Narayana,et al.  GPU Resource Sharing and Virtualization on High Performance Computing Systems , 2011, 2011 International Conference on Parallel Processing.