Runtime Support for Adaptive Spatial Partitioning and Inter-Kernel Communication on GPUs

GPUs have gained tremendous popularity in a broad range of application domains. These applications possess varying grains of parallelism and place high demands on compute resources -- many times imposing real-time constraints, requiring flexible work schedules, and relying on concurrent execution of multiple kernels on the device. These requirements present a number of challenges when targeting current GPUs. To support this class of applications, and to take full advantage of the large number of compute cores present on the GPU, we need a new mechanism to support concurrent execution and provide flexible mapping of compute kernels to the GPU. In this paper, we describe a new scheduling mechanism for dynamic spatial partitioning of the GPU, which adapts to the current execution state of compute workloads on the device. To enable this functionality, we extend the OpenCL runtime environment to map multiple command queues to a single device, and effectively partitioning the device. The result is that kernels that can benefit from concurrent execution on a partitioned device can effectively utilize the full compute resources on the GPU. To accelerate next-generation workloads, we also support an inter-kernel communication mechanism that enables concurrent kernels to interact in a producer-consumer relationship. The proposed partitioning mechanism is evaluated using real world applications taken from signal and image processing, linear algebra, and data mining domains. For these performance-hungry applications we achieve a 3.1X performance speedup using a combination of the proposed scheduling scheme and inter-kernel communication, versus relying on the conventional GPU runtime.

[1]  Giulio Giunta,et al.  A GPGPU Transparent Virtualization Component for High Performance Computing Clouds , 2010, Euro-Par.

[2]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[4]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[5]  Robert J. Harrison,et al.  Adapting Irregular Computations to Large CPU-GPU Clusters in the MADNESS Framework , 2012, 2012 IEEE International Conference on Cluster Computing.

[6]  Hiroshi Matsuo,et al.  RaVioli: a GPU Supported High-Level Pseudo Real-time Video Processing Library , 2011 .

[7]  Kevin Skadron,et al.  Load balancing in a changing world: dealing with heterogeneity and performance variability , 2013, CF '13.

[8]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[9]  Robert Ricci,et al.  Augmenting Operating Systems With the GPU , 2013, ArXiv.

[10]  Mateo Valero,et al.  Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[11]  Assaf Schuster,et al.  Processing data streams with hard real-time constraints on heterogeneous systems , 2011, ICS '11.

[12]  David A. Wood,et al.  QuickRelease: A throughput-oriented approach to release consistency on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[13]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[14]  Kevin Skadron,et al.  Fine-grained resource sharing for concurrent GPGPU kernels , 2012, HotPar'12.

[15]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[16]  John Kubiatowicz,et al.  GPUs as an opportunity for offloading garbage collection , 2012, ISMM '12.

[17]  David R. Kaeli,et al.  Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems , 2013, GPGPU@ASPLOS.

[18]  Klaus H. Hinrichs,et al.  Texturing techniques for terrain visualization , 2000, IEEE Visualization.

[19]  David A. Bader Designing Scalable Synthetic Compact Applications for Benchmarking High Productivity Computing Systems , 2006 .

[20]  David R. Kaeli,et al.  Analyzing program flow within a many-kernel OpenCL application , 2011, GPGPU-4.