GPUShare: Fair-Sharing Middleware for GPU Clouds

Many new cloud-focused applications such as deep learning and graph analytics have started to rely on the high computing throughput of GPUs, but cloud providers cannot currently support fine-grained time-sharing on GPUs to enable multi-tenancy for these types of applications. Currently, scheduling is performed by the GPU driver in combination with a hardware thread dispatcher to maximize utilization. However, when multiple applications with contrasting kernel running times and high-utilization of the GPU need to be co-located, this approach unduly favors one or more of the applications at the expense of others. This paper presents GPUShare, a middleware solution for GPU fair sharing among high-utilization, long-running applications. It begins by analyzing the scenarios under which the current driver-based multi-process scheduling fails, noting that such scenarios are quite common. It then describes a software-based mechanism that can yield a kernel before all of its threads have run, thus giving finer control over the time slice for which the GPU is allocated to a process. In controlling time slices on the GPU by yielding kernels, GPUShare improves fair GPU sharing across tenants and outperforms the CUDA driver by up to 45% for two tenants and by up to 89% for more than two tenants, while incurring a maximum overhead of only 12%. Additional improvements are obtained from having a central scheduler that further smooths out disparities across tenants' GPU shares improving fair sharing by up to 92% for two tenants and by up to 76% for more than two tenants.

[1]  Kevin Skadron,et al.  Load balancing in a changing world: dealing with heterogeneity and performance variability , 2013, CF '13.

[2]  Karsten Schwan,et al.  Scheduling Multi-tenant Cloud Workloads on Accelerator-Based Systems , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Mateo Valero,et al.  Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[4]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[5]  John D. Owens,et al.  Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[6]  Giulio Giunta,et al.  A GPGPU Transparent Virtualization Component for High Performance Computing Clouds , 2010, Euro-Par.

[7]  James W. Layland,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[8]  Michael L. Scott,et al.  Disengaged scheduling for fair, protected access to fast computational accelerators , 2014, ASPLOS.

[9]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[10]  Yaozu Dong,et al.  A Full GPU Virtualization Solution with Mediated Pass-Through , 2014, USENIX Annual Technical Conference.

[11]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[12]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[13]  Shinpei Kato,et al.  Gdev: First-Class GPU Resource Management in the Operating System , 2012, USENIX Annual Technical Conference.

[14]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[15]  Vanish Talwar,et al.  GViM: GPU-accelerated virtual machines , 2009, HPCVirt '09.

[16]  Shinpei Kato,et al.  GPUvm: Why Not Virtualizing GPUs at the Hypervisor? , 2014, USENIX Annual Technical Conference.

[17]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[18]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[19]  Vivek Sarkar,et al.  Languages and Compilers for Parallel Computing , 1994, Lecture Notes in Computer Science.

[20]  Jean-Philippe Martin,et al.  Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[21]  Scott A. Mahlke,et al.  Chimera: Collaborative Preemption for Multitasking on a Shared GPU , 2015, ASPLOS.

[22]  Srimat T. Chakradhar,et al.  Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework , 2011, HPDC '11.

[23]  Vanish Talwar,et al.  Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems , 2011, USENIX ATC.

[24]  Lin Shi,et al.  vCUDA: GPU accelerated high performance computing in virtual machines , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[25]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[26]  Laurent Massoulié,et al.  Bandwidth sharing: objectives and algorithms , 2002, TNET.

[27]  Johannes Stallkamp,et al.  Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition , 2012, Neural Networks.

[28]  Federico Silla,et al.  rCUDA: Reducing the number of GPU-based accelerators in high performance clusters , 2010, 2010 International Conference on High Performance Computing & Simulation.

[29]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.