论文信息 - Priority-Based PCIe Scheduling for Multi-Tenant Multi-GPU Systems

Priority-Based PCIe Scheduling for Multi-Tenant Multi-GPU Systems

Multi-GPU systems are widely used in data centers to provide significant speedups to compute-intensive workloads such as deep neural network training. However, limited PCIe bandwidth between the CPU and multiple GPUs becomes a major performance bottleneck. We observe that relying on a traditional Round-Robin-based PCIe scheduling policy can result in severe bandwidth competition and stall the execution of multiple GPUs. In this article, we propose a priority-based scheduling policy which aims to overlap the data transfers and GPU execution for different applications to alleviate this bandwidth contention. We also propose a dynamic priority policy for semi-QoS management that can help applications to meet QoS requirements and improve overall multi-GPU system throughput. Experimental results show that the system throughput is improved by 7.6 percent on average using our priority-based PCIe scheduling scheme as compared with a Round-Robin-based PCIe scheduler. Leveraging semi-QoS management can help to meet defined QoS goals, while preserving application throughput.

[1] Aamer Jaleel,et al. Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2] Wei Wei,et al. AI Matrix - Synthetic Benchmarks for DNN , 2018, ArXiv.

[3] Aamer Jaleel,et al. Beyond the Socket: NUMA-Aware GPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4] Karsten Schwan,et al. Multi-tenancy on GPGPU-based servers , 2013, VTDC '13.

[5] Wencong Xiao,et al. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads , 2019, USENIX Annual Technical Conference.

[6] Carole-Jean Wu,et al. MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[7] Jeremy Bottleson,et al. clCaffe: OpenCL Accelerated Caffe for Convolutional Neural Networks , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[8] Zhongliang Chen,et al. MGPUSim: Enabling Multi-GPU Performance Modeling and Optimization , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[9] Xiangyu Li,et al. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).