Execution of compound multi‐kernel OpenCL computations in multi‐CPU/multi‐GPU environments

Current computational systems are heterogeneous by nature, featuring a combination of CPUs and graphics processing units (GPUs). As the latter are becoming an established platform for high‐performance computing, the focus is shifting towards the seamless programming of these hybrid systems as a whole. The distinct nature of the architectural and execution models in place raises several challenges, as the best hardware configuration is behavior and workload dependent. In this paper, we address the execution of compound, multi‐kernel, open computing language computations in multi‐CPU/multi‐GPU environments. We address how these computations may be efficiently scheduled onto the target hardware, and how the system may adapt itself to changes in the workload to process and to fluctuations in the CPU's load. An experimental evaluation attests the performance gains obtained by the conjoined use of the CPU and GPU devices, when compared with GPU‐only executions, and also by the use of data‐locality optimizations in CPU environments. Copyright © 2015 John Wiley & Sons, Ltd.

[1]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[2]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[3]  Hervé Paulino,et al.  Algorithmic Skeleton Framework for the Orchestration of GPU Computations , 2013, Euro-Par.

[4]  Nuno Roma,et al.  Transparent Application Acceleration by Intelligent Scheduling of Shared Library Calls on Heterogeneous Systems , 2013, PPAM.

[5]  Sergei Gorlatch,et al.  SkelCL: Enhancing OpenCL for High-Level Programming of Multi-GPU Systems , 2013, PaCT.

[6]  Herbert Kuchen,et al.  Algorithmic skeletons for multi-core, multi-GPU systems and clusters , 2012, Int. J. High Perform. Comput. Netw..

[7]  Gagan Agrawal,et al.  Accelerating MapReduce on a coupled CPU-GPU architecture , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Jean-Philippe Martin,et al.  Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.

[9]  Weng-Fai Wong,et al.  Scalable framework for mapping streaming applications onto multi-GPU systems , 2012, PPoPP '12.

[10]  Vincent Loechner,et al.  Adaptive Runtime Selection for GPU , 2013, 2013 42nd International Conference on Parallel Processing.

[11]  Hervé Paulino,et al.  Towards the Transparent Execution of Compound OpenCL Computations in Multi-CPU/Multi-GPU Environments , 2014, Euro-Par Workshops.

[12]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[13]  Hervé Paulino,et al.  On the support of task-parallel algorithmic skeletons for multi-GPU computing , 2014, SAC.

[14]  Michael Alexander,et al.  Euro-Par 2014: Parallel Processing Workshops , 2014, Lecture Notes in Computer Science.

[15]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[17]  Christoph W. Kessler,et al.  Adaptive Implementation Selection in the SkePU Skeleton Programming Library , 2013, APPT.

[18]  David A. Padua,et al.  Performance Portability with the Chapel Language , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[19]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.