Simplifying programming and load balancing of data parallel applications on heterogeneous systems

Heterogeneous architectures have experienced a great development thanks to their excellent cost/performance ratio and low power consumption. But heterogeneity significantly complicates both programming and efficient use of the resources. As a result, programmers have ended up using fixed roles for each kind of device: CPUs for sequential and management tasks and GPUs for parallel work. This is a waste of computing power. Maat is a library for OpenCL programmers that allows for the efficient execution of a single data-parallel kernel using all the available devices. It provides the programmer with an abstract view of the system to enable the management of heterogeneous environments regardless of the underlying architecture, and a set of load balancing methods, which perform data distribution. With Maat, programmers only need to develop a data-parallel kernel, select a load balancing method, and run it on the whole system. Experimental results show that Maat efficiently utilizes all the resources, independently of their number and nature. Provided the most appropriate method is selected, Maat is able to achieve a speedup of up to 1.97 using two GPUs with respect to a single GPU and even over 2 when the CPUs, which are much less performant, come into play.

[1]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[2]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[3]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[4]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[5]  Scott A. Mahlke,et al.  SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration , 2015, ACM Trans. Comput. Syst..

[6]  Rafael Asenjo,et al.  Strategies for maximizing utilization on multi-CPU and multi-GPU heterogeneous architectures , 2014, The Journal of Supercomputing.

[7]  Jaejin Lee,et al.  Automatic OpenCL work-group size selection for multicore CPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[8]  Jeffrey S. Vetter,et al.  Maestro: Data Orchestration and Tuning for OpenCL Devices , 2010, Euro-Par.

[9]  Jack J. Dongarra,et al.  Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[10]  Ziming Zhong,et al.  Data Partitioning on Multicore and Multi-GPU Platforms Using Functional Performance Models , 2015, IEEE Transactions on Computers.

[11]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Keshav Pingali,et al.  Adaptive heterogeneous scheduling for integrated GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[13]  Jianlong Zhong,et al.  Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling , 2013, IEEE Transactions on Parallel and Distributed Systems.

[14]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Francisco Almeida,et al.  Dynamic load balancing on heterogeneous multicore/multiGPU systems , 2010, 2010 International Conference on High Performance Computing & Simulation.

[16]  Pablo Toharia,et al.  Static Multi-device Load Balancing for OpenCL , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[17]  Bronis R. de Supinski,et al.  Heterogeneous Task Scheduling for Accelerated OpenMP , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[18]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[19]  Kevin Skadron,et al.  Load balancing in a changing world: dealing with heterogeneity and performance variability , 2013, CF '13.