论文信息 - GPUShare: Fair-Sharing Middleware for GPU Clouds

GPUShare: Fair-Sharing Middleware for GPU Clouds

Many new cloud-focused applications such as deep learning and graph analytics have started to rely on the high computing throughput of GPUs, but cloud providers cannot currently support fine-grained time-sharing on GPUs to enable multi-tenancy for these types of applications. Currently, scheduling is performed by the GPU driver in combination with a hardware thread dispatcher to maximize utilization. However, when multiple applications with contrasting kernel running times and high-utilization of the GPU need to be co-located, this approach unduly favors one or more of the applications at the expense of others. This paper presents GPUShare, a middleware solution for GPU fair sharing among high-utilization, long-running applications. It begins by analyzing the scenarios under which the current driver-based multi-process scheduling fails, noting that such scenarios are quite common. It then describes a software-based mechanism that can yield a kernel before all of its threads have run, thus giving finer control over the time slice for which the GPU is allocated to a process. In controlling time slices on the GPU by yielding kernels, GPUShare improves fair GPU sharing across tenants and outperforms the CUDA driver by up to 45% for two tenants and by up to 89% for more than two tenants, while incurring a maximum overhead of only 12%. Additional improvements are obtained from having a central scheduler that further smooths out disparities across tenants' GPU shares improving fair sharing by up to 92% for two tenants and by up to 76% for more than two tenants.

[1] Kevin Skadron,et al. Load balancing in a changing world: dealing with heterogeneity and performance variability , 2013, CF '13.

[2] Karsten Schwan,et al. Scheduling Multi-tenant Cloud Workloads on Accelerator-Based Systems , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3] Mateo Valero,et al. Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[4] Trevor Darrell,et al. Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[5] John D. Owens,et al. Gunrock: a high-performance graph processing library on the GPU , 2015, PPoPP.

[6] Giulio Giunta,et al. A GPGPU Transparent Virtualization Component for High Performance Computing Clouds , 2010, Euro-Par.

[7] James W. Layland,et al. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[8] Michael L. Scott,et al. Disengaged scheduling for fair, protected access to fast computational accelerators , 2014, ASPLOS.

[9] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[10] Yaozu Dong,et al. A Full GPU Virtualization Solution with Mediated Pass-Through , 2014, USENIX Annual Technical Conference.

[11] Mark Silberstein,et al. PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[12] Jure Leskovec,et al. {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[13] Shinpei Kato,et al. Gdev: First-Class GPU Resource Management in the Operating System , 2012, USENIX Annual Technical Conference.

[14] Wen-mei W. Hwu,et al. MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[15] Vanish Talwar,et al. GViM: GPU-accelerated virtual machines , 2009, HPCVirt '09.

[16] Shinpei Kato,et al. GPUvm: Why Not Virtualizing GPUs at the Hypervisor? , 2014, USENIX Annual Technical Conference.

[17] R. Govindarajan,et al. Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[18] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[19] Vivek Sarkar,et al. Languages and Compilers for Parallel Computing , 1994, Lecture Notes in Computer Science.

[20] Jean-Philippe Martin,et al. Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[21] Scott A. Mahlke,et al. Chimera: Collaborative Preemption for Multitasking on a Shared GPU , 2015, ASPLOS.

[22] Srimat T. Chakradhar,et al. Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework , 2011, HPDC '11.

[23] Vanish Talwar,et al. Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems , 2011, USENIX ATC.

[24] Lin Shi,et al. vCUDA: GPU accelerated high performance computing in virtual machines , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[25] Scott A. Mahlke,et al. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[26] Laurent Massoulié,et al. Bandwidth sharing: objectives and algorithms , 2002, TNET.

[27] Johannes Stallkamp,et al. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition , 2012, Neural Networks.

[28] Federico Silla,et al. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters , 2010, 2010 International Conference on High Performance Computing & Simulation.

[29] Nam Sung Kim,et al. The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.