论文信息 - A memory-driven scheduling scheme and optimization for concurrent execution in GPU

A memory-driven scheduling scheme and optimization for concurrent execution in GPU

Concurrent execution of GPU tasks is available in modern GPU device. However, limited device memory is an obvious bottleneck in executing many GPU tasks. And the task priority and system performance are often ignored. To address these, a real-time GPU scheduling scheme is proposed in this paper. A reservation algorithm based on device memory(RBDM) is adopted to provide more opportunity for the High-priority task in the scheme. high priority first wake (HPFW) and small memory HPFW (SM-HPFW) are employed in the scheduling of waiting tasks to improve the priority response time and system performance. A CPU-based monitor is developed to check the GPU task execution. Experiments show the RBDM can work effectively. Compared with FIFO, HPFW can decrease overall priority response time significantly. Overall task completion time can be reduced by 20 % using the SM-HPFW while the distribution of device memory requirement of GPU tasks is even.

[1] Shinpei Kato,et al. TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[2] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[3] Damien Hardy,et al. Predictable Code and Data Paging for Real Time Systems , 2008, 2008 Euromicro Conference on Real-Time Systems.

[4] Michael Stumm,et al. BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[5] Ragunathan Rajkumar,et al. Memory reservation and shared page management for real-time systems , 2014, J. Syst. Archit..

[6] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[7] Bingsheng He,et al. Parallel Data Mining on Graphics Processors , 2011 .

[8] Martin Burtscher,et al. Floating-point data compression at 75 Gb/s on a GPU , 2011, GPGPU-4.

[9] Che-Lun Hung,et al. Local Alignment Tool Based on Hadoop Framework and GPU Architecture , 2014, BioMed research international.

[10] K. Legge,et al. Lymph node dendritic cells control CD8+ T cell responses through regulated FasL expression. , 2005, Immunity.

[11] Edwin K. P. Chong,et al. Performance evaluation of scheduling algorithms for imprecise computer systems , 1991, J. Syst. Softw..

[12] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13] Ragunathan Rajkumar,et al. Shared-Page Management for Improving the Temporal Isolation of Memory Reservations in Resource Kernels , 2012, 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications.

[14] Ragunathan Rajkumar,et al. Energy-aware memory firewalling for QoS-sensitive applications , 2005, 17th Euromicro Conference on Real-Time Systems (ECRTS'05).

[15] Hadi Yazdanpanah,et al. Evaluation Performance of Task Scheduling Algorithms in Heterogeneous Environments , 2016 .

[16] Claudia-Lavinia Ignat,et al. Enhancing rich content wikis with real‐time collaboration , 2017 .

[17] John D. Owens,et al. Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[18] Haitao Wang,et al. The Nerve Growth Factor Signaling and Its Potential as Therapeutic Target for Glaucoma , 2014, BioMed research international.

[19] Dawei Wang,et al. Concurrent Average Memory Access Time , 2014, Computer.

[20] Dongkun Shin,et al. Resource-constrained spatial multi-tasking for embedded GPU , 2014, 2014 IEEE International Conference on Consumer Electronics (ICCE).

[21] Mahmut T. Kandemir,et al. Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications , 2014, GPGPU@ASPLOS.

[22] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[23] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[24] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[25] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.