Static partitioning and mapping of kernel-based applications over modern heterogeneous architectures

Abstract Heterogeneous architectures are being used extensively to improve system processing capabilities. Critical functions of each application (kernels) can be mapped to different computing devices (i.e. CPUs, GPGPUs, accelerators) to maximize performance. However, best performance can only be achieved if kernels are accurately mapped to the right device. Moreover, in some cases those kernels could be split and executed over several devices at the same time to maximize the use of compute resources on heterogeneous parallel architectures. In this paper, we define a static partitioning model based on profiling information from previous executions. This model follows a quantitative model approach which computes the optimal match according to user-defined constraints. We test different scenarios to evaluate our model: single kernel and multi-kernel applications. Experimental results show that our static partitioning model could increase performance of parallel applications by deploying not only different kernels over different devices but a single kernel over multiple devices. This allows to avoid having idle compute resources on heterogeneous platforms, as well as enhancing the overall performance.

[1]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[2]  Thomas Fahringer,et al.  LibWater: heterogeneous distributed computing made easy , 2013, ICS '13.

[3]  Ozcan Ozturk,et al.  Improving application behavior on heterogeneous manycore systems through kernel mapping , 2013, Parallel Comput..

[4]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[5]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[7]  Christoph W. Kessler,et al.  Auto-tuning SkePU: a multi-backend skeleton programming framework for multi-GPU systems , 2011, IWMSE '11.

[8]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[9]  Jie Shen,et al.  Look before You Leap: Using the Right Hardware Resources to Accelerate Applications , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[10]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[11]  Jie Shen,et al.  Improving performance by matching imbalanced workloads with heterogeneous platforms , 2014, ICS '14.

[12]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[13]  Soledad Escolar,et al.  A Comparative Study and Evaluation of Parallel Programming Models for Shared-Memory Parallel Architectures , 2013, New Generation Computing.

[14]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Thomas Fahringer,et al.  An automatic input-sensitive approach for heterogeneous task partitioning , 2013, ICS '13.

[16]  Pablo Toharia,et al.  Static Multi-device Load Balancing for OpenCL , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[17]  Pedro Trancoso,et al.  Trends in High-Performance Computing , 2011, Computing in Science & Engineering.