Towards Transparently Tackling Functionality and Performance Issues across Different OpenCL Platforms

OpenCL applications may present tight constraints on work-group size due to algorithm design or chosen implementation strategy. This may hamper functional or performance portability across different platforms, due to lack of resources. The current solution is to re-design the implementation, optimizing it for the new platform. However, this can become a showstopper for new platforms, for which a large manual optimization effort is needed to port benchmark suites and applications. In this work, we aim at tackling such issues by applying work-item coalescing techniques to optimize the mapping of the work-items to the processing elements. However, this is generally not sufficient to achieve good performance as different design patterns may be applied to exploit the specific features of the target architecture. We show how additional target specific transformations can improve the performance with respect to the work-items coalescing baseline. We employ a Matrix Multiply case study to show how the work-item coalescing transformations can impact functional portability, together with providing an opportunity of automatically inserting the use of asynchronous copies on embedded many-core platforms endowed with such a feature.

[1]  Michael F. P. O'Boyle,et al.  A large-scale cross-architecture evaluation of thread-coarsening , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Sean Rul,et al.  An experimental study on performance portability of OpenCL kernels , 2010, HiPC 2010.

[3]  Michael F. P. O'Boyle,et al.  Exploiting GPU Hardware Saturation for Fast Compiler Optimization , 2014, GPGPU@ASPLOS.

[4]  Luca Benini,et al.  P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[6]  Cédric Augonnet,et al.  PEPPHER: Efficient and Productive Usage of Hybrid Computing Systems , 2011, IEEE Micro.

[7]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[8]  Giovanni Agosta,et al.  OpenCL performance portability for general‐purpose computation on graphics processor units: an exploration on cryptographic primitives , 2015, Concurr. Comput. Pract. Exp..

[9]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[10]  Yi Yang,et al.  Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs , 2012, 2012 41st International Conference on Parallel Processing.

[11]  Wen-mei W. Hwu,et al.  Performance Portability in Accelerated Parallel Kernels , 2013 .

[12]  Yao Zhang,et al.  Improving Performance Portability in OpenCL Programs , 2013, ISC.

[13]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).