Customization of OpenCL applications for efficient task mapping under heterogeneous platform constraints

When targeting an OpenCL application to platforms with multiple heterogeneous accelerators, task tuning and mapping have to cope with device-specific constraints. To address this problem, we present an innovative design flow for the customization and performance optimization of OpenCL applications on heterogeneous parallel platforms. It consists of two phases: 1) a tuning phase that optimizes each application kernel for a given platform and 2) a task-mapping phase that maximizes the overall application throughput by exploiting concurrency in the application task graph. The tuning phase is suitable for customizing parameterized OpenCL kernels considering device-specific constraints. Then, the mapping phase improves task-level parallelism for multi-device execution accounting for the overhead of memory transfers - overheads implied by multiple OpenCL contexts for different device vendors. Benefits of the proposed design flow have been assessed on a stereo-matching application targeting two commercial heterogeneous platforms.

[1]  Raymond Namyst,et al.  Toward OpenCL Automatic Multi-Device Support , 2014, Euro-Par.

[2]  Lothar Thiele,et al.  Exploiting the parallelism of heterogeneous systems using dataflow graphs on top of OpenCL , 2013, The 11th IEEE Symposium on Embedded Systems for Real-time Multimedia.

[3]  Jürgen Teich,et al.  Dynamic Task-Scheduling and Resource Management for GPU Accelerators in Medical Imaging , 2012, ARCS.

[4]  Greg Stitt,et al.  Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing , 2010, LCTES '10.

[5]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[6]  Jie Shen,et al.  Performance Traps in OpenCL for CPUs , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[7]  Chantal Ykman-Couvreur,et al.  MULTICUBE: Multi-objective Design Space Exploration of Multi-core Architectures , 2010, 2010 IEEE Computer Society Annual Symposium on VLSI.

[8]  Peter J. Stuckey,et al.  MiniZinc: Towards a Standard CP Modelling Language , 2007, CP.

[9]  William B. Ackerman,et al.  Data Flow Languages , 1899, Computer.

[10]  Rudy Lauwereins,et al.  Real-Time and Accurate Stereo: A Scalable Approach With Bitwise Fast Voting on CUDA , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[11]  Jürgen Teich,et al.  Generating Device-specific GPU Code for Local Operators in Medical Imaging , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[12]  Michael J. Flynn Flynn's Taxonomy , 2011, Encyclopedia of Parallel Computing.

[13]  Stefan Lankes,et al.  The development of a scheduling system GPUSched for graphics processing units , 2013, 2013 International Conference on High Performance Computing & Simulation (HPCS).

[14]  Pierre G. Paulin,et al.  A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[15]  E.A. Lee,et al.  Synchronous data flow , 1987, Proceedings of the IEEE.

[16]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[17]  Mohammad Abdullah Al Faruque,et al.  GPU-EvR: Run-time event based real-time scheduling framework on GPGPU platform , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Toby Walsh,et al.  Handbook of Constraint Programming , 2006, Handbook of Constraint Programming.

[19]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[20]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.