Mastering Software Variant Explosion for GPU Accelerators

Mapping algorithms in an efficient way to the target hardware poses a challenge for algorithm designers. This is particular true for heterogeneous systems hosting accelerators like graphics cards. While algorithm developers have profound knowledge of the application domain, they often lack detailed insight into the underlying hardware of accelerators in order to exploit the provided processing power. Therefore, this paper introduces a rule-based, domain-specific optimization engine for generating the most appropriate code variant for different Graphics Processing Unit (GPU) accelerators. The optimization engine relies on knowledge fused from the application domain and the target architecture. The optimization engine is embedded into a framework that allows to design imaging algorithms in a Domain-Specific Language (DSL). We show that this allows to have one common description of an algorithm in the DSL and select the optimal target code variant for different GPU accelerators and target languages like CUDA and OpenCL.

[1]  Jack Dongarra,et al.  Computational Science – ICCS 2009: 9th International Conference Baton Rouge, LA, USA, May 25-27, 2009 Proceedings, Part I , 2009, ICCS.

[2]  Thomas Fahringer,et al.  Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design , 2011, Euro-Par.

[3]  Gang Ren,et al.  Is Search Really Necessary to Generate High-Performance BLAS? , 2005, Proceedings of the IEEE.

[4]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[5]  Emmanuel Jeannot,et al.  Euro-Par 2011 Parallel Processing , 2011, Lecture Notes in Computer Science.

[6]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[7]  Klaus Pohl,et al.  Software Product Line Engineering , 2005 .

[8]  Michael F. P. O'Boyle,et al.  A workload-aware mapping approach for data-parallel programs , 2011, HiPEAC.

[9]  Jürgen Teich,et al.  Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators Based on a Domain-Specific Language for Medical Imaging , 2012, 2012 11th International Symposium on Parallel and Distributed Computing.

[10]  Wen-mei W. Hwu,et al.  Program optimization carving for GPU computing , 2008, J. Parallel Distributed Comput..

[11]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[12]  Krzysztof Czarnecki,et al.  Generative programming - methods, tools and applications , 2000 .

[13]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[14]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[15]  Jürgen Teich,et al.  Generating Device-specific GPU Code for Local Operators in Medical Imaging , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[16]  Klaus Pohl,et al.  Software Product Line Engineering - Foundations, Principles, and Techniques , 2005 .