Analyzing and Estimating the Performance of Concurrent Kernels Execution on GPUs

GPUs have established a new baseline for power efficiency and computing power, delivering larger bandwidth and more computing units in each new generation. Modern GPUs support the concurrent execution of kernels to maximize resource utilization, allowing other kernels to better exploit idle resources. However, the decision on the simultaneous execution of different kernels is made by the hardware, and sometimes GPUs do not allow the execution of blocks from other kernels, even with the availability of resources. In this work, we present an in-depth study on the simultaneous execution of kernels on the GPU. We present the necessary conditions for executing kernels simultaneously, we define the factors that influence competition, and describe a model that can determine performance degradation. Finally, we validate the model using synthetic and real-world kernels with different computation and memory requirements.

[1]  T. Steinke,et al.  On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[2]  Mahmut T. Kandemir,et al.  Anatomy of GPU Memory System for Multi-Application Execution , 2015, MEMSYS.

[3]  Youyou Lu,et al.  Run-Time Performance Estimation and Fairness-Oriented Scheduling Policy for Concurrent GPGPU Applications , 2016, 2016 45th International Conference on Parallel Processing (ICPP).

[4]  Mateo Valero,et al.  Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[5]  David Black-Schaffer,et al.  Partitioning GPUs for Improved Scalability , 2016, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[6]  Shinpei Kato,et al.  GPUvm: Why Not Virtualizing GPUs at the Hypervisor? , 2014, USENIX Annual Technical Conference.

[7]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[8]  Scott A. Mahlke,et al.  Chimera: Collaborative Preemption for Multitasking on a Shared GPU , 2015, ASPLOS.

[9]  Tao Li,et al.  Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[10]  Onur Mutlu,et al.  The application slowdown model: Quantifying and controlling the impact of inter-application interference at shared caches and main memory , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Zhongliang Chen,et al.  NUPAR: A Benchmark Suite for Modern GPU Architectures , 2015, ICPE.

[12]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[13]  Mattan Erez,et al.  A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC , 2012, DAC Design Automation Conference 2012.

[14]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[15]  Ben H. H. Juurlink,et al.  GPGPU workload characteristics and performance analysis , 2014, 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV).

[16]  Nam Sung Kim,et al.  Fair share: Allocation of GPU resources for both performance and fairness , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).

[17]  Rachata Ausavarungnirun,et al.  Techniques for Shared Resource Management in Systems with Throughput Processors , 2018, ArXiv.

[18]  Vikram K. Narayana,et al.  GPU Resource Sharing and Virtualization on High Performance Computing Systems , 2011, 2011 International Conference on Parallel Processing.

[19]  Vikram K. Narayana,et al.  A Power-Aware Symbiotic Scheduling Algorithm for Concurrent GPU Kernels , 2015, 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS).