Dissecting the CUDA scheduling hierarchy: a Performance and Predictability Perspective

Over the last few years, the ever-increasing use of Graphic Processing Units (GPUs) in safety-related domains has opened up many research problems in the real-time community. The closed and proprietary nature of the scheduling mechanisms deployed in NVIDIA GPUs, for instance, represents a major obstacle in deriving a proper schedulability analysis for latency-sensitive applications. Existing literature addresses these issues by either (i) providing simplified models for heterogeneous CPUGPU systems and their associated scheduling policies, or (ii) providing insights about these arbitration mechanisms obtained through reverse engineering. In this paper, we take one step further by correcting and consolidating previously published assumptions about the hierarchical scheduling policies of NVIDIA GPUs and their proprietary CUDA application programming interface. We also discuss how such mechanisms evolved with recently released GPU micro-architectures, and how such changes influence the scheduling models to be exploited by real-time system engineers.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Joseph Zambreno,et al.  Increasing GPU throughput using kernel interleaved thread block scheduling , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[3]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[4]  Nicola Capodieci,et al.  Deadline-Based Scheduling for GPU with Preemption Support , 2018, 2018 IEEE Real-Time Systems Symposium (RTSS).

[5]  Nicola Capodieci,et al.  Memory interference characterization between CPU cores and integrated GPUs in mixed-criticality platforms , 2017, 2017 22nd IEEE International Conference on Emerging Technologies and Factory Automation (ETFA).

[6]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Esteban Walter Gonzalez Clua,et al.  Maximizing the GPU resource usage by reordering concurrent kernels submission , 2019, Concurr. Comput. Pract. Exp..

[8]  Nicola Capodieci,et al.  Work-in-Progress: NVIDIA GPU Scheduling Details in Virtualized Environments , 2018, 2018 International Conference on Embedded Software (EMSOFT).

[9]  Hyeran Jeon,et al.  Tango: A Deep Neural Network Benchmark Suite for Various Accelerators , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Ming Yang,et al.  GPU Scheduling on the NVIDIA TX2: Hidden Details Revealed , 2017, 2017 IEEE Real-Time Systems Symposium (RTSS).

[11]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[12]  Ajay Jain,et al.  Dynamic Space-Time Scheduling for GPU Inference , 2018, ArXiv.

[13]  Jianlong Zhong,et al.  Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling , 2013, IEEE Transactions on Parallel and Distributed Systems.

[14]  Paolo Valente,et al.  SiGAMMA: server based integrated GPU arbitration mechanism for memory accesses , 2017, RTNS.

[15]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Francisco J. Cazorla,et al.  Generating and Exploiting Deep Learning Variants to Increase Heterogeneous Resource Utilization in the NVIDIA Xavier , 2019, ECRTS.

[17]  Hao Li,et al.  Performance modeling in CUDA streams — A means for high-throughput data processing , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[18]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[19]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[20]  Ming Yang,et al.  Avoiding Pitfalls when Using NVIDIA GPUs for Real-Time Tasks in Autonomous Systems , 2018, ECRTS.

[21]  Nanning Zheng,et al.  Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[22]  Hadi Sadoghi Yazdi,et al.  cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs , 2020, IEEE Transactions on Parallel and Distributed Systems.

[23]  R. Govindarajan,et al.  Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[24]  Gert-Jan van den Braak,et al.  Analysis and Modeling of the Timing Behavior of GPU Architectures , 2014 .

[25]  T. Steinke,et al.  On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[26]  Nicola Capodieci,et al.  A Perspective on Safety and Real-Time Issues for GPU Accelerated ADAS , 2018, IECON 2018 - 44th Annual Conference of the IEEE Industrial Electronics Society.