EffiSha: A Software Framework for Enabling Effficient Preemptive Scheduling of GPU

Modern GPUs are broadly adopted in many multitasking environments, including data centers and smartphones. However, the current support for the scheduling of multiple GPU kernels (from different applications) is limited, forming a major barrier for GPU to meet many practical needs. This work for the first time demonstrates that on existing GPUs, efficient preemptive scheduling of GPU kernels is possible even without special hardware support. Specifically, it presents EffiSha, a pure software framework that enables preemptive scheduling of GPU kernels with very low overhead. The enabled preemptive scheduler offers flexible support of kernels of different priorities, and demonstrates significant potential for reducing the average turnaround time and improving the system overall throughput of programs that time share a modern GPU.

[1]  Dong Li,et al.  Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations , 2015, ICS.

[2]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[4]  Jianlong Zhong,et al.  Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling , 2013, IEEE Transactions on Parallel and Distributed Systems.

[5]  Idit Keidar,et al.  GPUfs: Integrating a file system with GPUs , 2013, TOCS.

[6]  Andrew S. Tanenbaum,et al.  Modern Operating Systems , 1992 .

[7]  Dong Li,et al.  PORPLE: An Extensible Optimizer for Portable Data Placement on GPU , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[9]  Rami G. Melhem,et al.  Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[10]  Quan Chen,et al.  Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers , 2016, ASPLOS.

[11]  Michael L. Scott,et al.  Disengaged scheduling for fair, protected access to fast computational accelerators , 2014, ASPLOS.

[12]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[13]  Cong Liu,et al.  GPES: a preemptive execution system for GPGPU computing , 2015, 21st IEEE Real-Time and Embedded Technology and Applications Symposium.

[14]  Guoyang Chen,et al.  Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU , 2016, ICS.

[15]  Guoyang Chen,et al.  Free launch: Optimizing GPU dynamic kernel launches through thread reuse , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Shinpei Kato,et al.  GPUvm: Why Not Virtualizing GPUs at the Hypervisor? , 2014, USENIX Annual Technical Conference.

[17]  Yi Yang,et al.  CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.

[18]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[19]  Shinpei Kato,et al.  GDM: device memory management for gpgpu computing , 2014, SIGMETRICS '14.

[20]  Dong Li,et al.  Optimizing Data Placement on GPU Memory: A Portable Approach , 2017, IEEE Transactions on Computers.

[21]  Shinpei Kato,et al.  Gdev: First-Class GPU Resource Management in the Operating System , 2012, USENIX Annual Technical Conference.

[22]  Dieter Schmalstieg,et al.  Whippletree , 2014, ACM Trans. Graph..

[23]  Timo Aila,et al.  Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.

[24]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[25]  Anjul Patney,et al.  Task management for irregular-parallel workloads on the GPU , 2010, HPG '10.

[26]  Scott A. Mahlke,et al.  Chimera: Collaborative Preemption for Multitasking on a Shared GPU , 2015, ASPLOS.

[27]  Roberto Di Pietro,et al.  CUDA Leaks , 2013, ACM Trans. Embed. Comput. Syst..

[28]  Kyoung-Don Kang,et al.  Supporting Preemptive Task Executions and Memory Copies in GPGPUs , 2012, 2012 24th Euromicro Conference on Real-Time Systems.

[29]  Zhen Lin,et al.  Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Mateo Valero,et al.  Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[31]  Collin McCurdy,et al.  Scalable Heterogeneous Computing (SHOC) Benchmark Suite, Version 0.8 , 2009 .

[32]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.