Quality of service support for fine-grained sharing on GPUs

GPUs have been widely adopted in data centers to provide acceleration services to many applications. Sharing a GPU is increasingly important for better processing throughput and energy efficiency. However, quality of service (QoS) among concurrent applications is minimally supported. Previous efforts are too coarse-grained and not scalable with increasing QoS requirements. We propose QoS mechanisms for a fine-grained form of GPU sharing. Our QoS support can provide control over the progress of kernels on a per cycle basis and the amount of thread-level parallelism of each kernel. Due to accurate resource management, our QoS support has significantly better scalability compared with previous best efforts. Evaluations show that, when the GPU is shared by three kernels, two of which have QoS goals, the proposed techniques achieve QoS goals 43.8% more often than previous techniques and have 20.5% higher throughput.

[1]  Rami G. Melhem,et al.  SAWS: Synchronization aware GPGPU warp scheduling for multiple independent warp schedulers , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Won Woo Ro,et al.  Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[3]  Rami G. Melhem,et al.  Simultaneous Multikernel: Fine-Grained Sharing of GPUs , 2016, IEEE Computer Architecture Letters.

[4]  Quan Chen,et al.  DjiNN and Tonic: DNN as a service and its implications for future warehouse scale computers , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[5]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[6]  Mikhail Bautin,et al.  Graphic engine resource management , 2008, Electronic Imaging.

[7]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[8]  Kevin Skadron,et al.  Fine-grained resource sharing for concurrent GPGPU kernels , 2012, HotPar'12.

[9]  Jianlong Zhong,et al.  Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[10]  Harrick M. Vin,et al.  A hierarchial CPU scheduler for multimedia operating systems , 1996, OSDI '96.

[11]  John Kim,et al.  Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[12]  Michael L. Scott,et al.  Enabling OS Research by Inferring Interactions in the Black-Box GPU Stack , 2013, USENIX Annual Technical Conference.

[13]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[14]  R. Govindarajan,et al.  Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[15]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  David R. Cheriton,et al.  Borrowed-virtual-time (BVT) scheduling: supporting latency-sensitive threads in a general-purpose scheduler , 1999, OPSR.

[17]  Xiangyu Li,et al.  Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[18]  Scott A. Mahlke,et al.  Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Nam Sung Kim,et al.  Fair share: Allocation of GPU resources for both performance and fairness , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[20]  Rami G. Melhem,et al.  Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Mahmut T. Kandemir,et al.  Exploiting Core Criticality for Enhanced GPU Performance , 2016, SIGMETRICS.

[22]  Harrick M. Vin,et al.  A hierarchial CPU scheduler for multimedia operating systems , 1996, OSDI '96.

[23]  Mateo Valero,et al.  Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[24]  Mahmut T. Kandemir,et al.  Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[25]  Wei Yi,et al.  Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[26]  Prashant J. Shenoy,et al.  Surplus fair scheduling: a proportional-share CPU scheduling algorithm for symmetric multiprocessors , 2000, OSDI.

[27]  Scott Shenker,et al.  Analysis and simulation of a fair queueing algorithm , 1989, SIGCOMM '89.

[28]  Jiaxing Zhang,et al.  Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning , 2014 .

[29]  Yin Wang,et al.  VGRIS: Virtualized GPU Resource Isolation and Scheduling in Cloud Gaming , 2013, TACO.

[30]  Quan Chen,et al.  Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers , 2016, ASPLOS.

[31]  Michael L. Scott,et al.  Disengaged scheduling for fair, protected access to fast computational accelerators , 2014, ASPLOS.

[32]  Peter Kulchyski and , 2015 .

[33]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[34]  Klara Nahrstedt,et al.  Energy-efficient soft real-time CPU scheduling for mobile multimedia systems , 2003, SOSP '03.

[35]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[36]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[37]  Michael F. P. O'Boyle,et al.  Portable and transparent software managed scheduling on accelerators for fair resource sharing , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[38]  Nam Sung Kim,et al.  QoS-aware dynamic resource allocation for spatial-multitasking GPUs , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[39]  George Varghese,et al.  Efficient fair queueing using deficit round robin , 1995, SIGCOMM '95.

[40]  Hussein M. Abdel-Wahab,et al.  A proportional share resource allocation algorithm for real-time, time-shared systems , 1996, 17th IEEE Real-Time Systems Symposium.

[41]  Scott A. Mahlke,et al.  Chimera: Collaborative Preemption for Multitasking on a Shared GPU , 2015, ASPLOS.

[42]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[43]  Dong Li,et al.  Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations , 2015, ICS.

[44]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .