Simultaneous Multikernel: Fine-Grained Sharing of GPUs

Studies show that non-graphics programs can be less optimized for the GPU hardware, leading to significant resource under-utilization. Sharing the GPU among multiple programs can effectively improve utilization, which is particularly attractive to systems (e.g., cloud computing) where many applications require access to the GPU. However, current GPUs lack proper architecture features to support sharing. Initial attempts are very preliminary in that they either provide only static sharing, which requires recompilation or code transformation, or they do not effectively improve GPU resource utilization. We propose Simultaneous Multikernel (SMK), a fine-grained dynamic sharing mechanism, that fully utilizes resources within a streaming multiprocessor by exploiting heterogeneity of different kernels. We extend the GPU hardware to support SMK, and propose several resource allocation strategies to improve system throughput while maintaining fairness. Our evaluation of 45 shared workloads shows that SMK improves GPU throughput by 34 percent over non-shared execution and 10 percent over a state-of-the-art design.

[1]  John Kim,et al.  Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[2]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[3]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[4]  Mateo Valero,et al.  Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[5]  Scott A. Mahlke,et al.  Chimera: Collaborative Preemption for Multitasking on a Shared GPU , 2015, ASPLOS.

[6]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[7]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[8]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[9]  Wei Yi,et al.  Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[10]  Kevin Skadron,et al.  Fine-grained resource sharing for concurrent GPGPU kernels , 2012, HotPar'12.

[11]  Babak Falsafi,et al.  Reference idempotency analysis: a framework for optimizing speculative execution , 2001, PPoPP '01.

[12]  Michael L. Scott,et al.  Disengaged scheduling for fair, protected access to fast computational accelerators , 2014, ASPLOS.

[13]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[14]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[15]  Dong Li,et al.  Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations , 2015, ICS.