OCLoptimizer: An Iterative Optimization Tool for OpenCL

Nowadays, computers include several computational devices with parallel capacities, such as multicore processors and Graphic Processing Units (GPUs). OpenCL enables the programming of all these kinds of devices. An OpenCL program consists of a host code which discovers the computational devices available in the host system and it queues up commands to the devices, and the kernel code which defines the core of the parallel computation executed in the devices. This work addresses two of the most important problems faced by an OpenCL programmer: (1) hosts codes are quite verbose but they can be automatically generated if some parameters are known; (2) OpenCL codes that are hand-optimized for a given device do not get necessarily a good performance in a different one. This paper presents a source-to-source iterative optimization tool, called OCLoptimizer, that aims to generate host codes automatically and to optimize OpenCL kernels taking as inputs an annotated version of the original kernel and a configuration file. Iterative optimization is a well-known technique which allows to optimize a given code by exploring different configuration parameters in a systematic manner. For example, we can apply tiling on one loop and the iterative optimizer would select the optimal tile size by exploring the space of possible tile sizes. The experimental results show that the tool can automatically optimize a set of OpenCL kernels for multicore processors.

[1]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[2]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[3]  Albert Cohen,et al.  Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[4]  Greg Stitt,et al.  Elastic computing: a framework for transparent, portable, and adaptive multi-core heterogeneous computing , 2010, LCTES '10.

[5]  Laurie A. Smith King,et al.  VForce: An environment for portable applications on high performance systems with accelerators , 2012, J. Parallel Distributed Comput..

[6]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[7]  J. Kulpa,et al.  Time-frequency analysis using NVIDIA compute unified device architecture (CUDA) , 2009, Symposium on Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments (WILGA).

[8]  Basilio B. Fraguela,et al.  Optimal Tile Size Selection Guided by Analytical Models , 2005, PARCO.

[9]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[10]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[11]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[12]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[13]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[14]  Martin Margala,et al.  Sobel edge detection processor for a real-time volume rendering system , 2004, 2004 IEEE International Symposium on Circuits and Systems (IEEE Cat. No.04CH37512).

[15]  BastoulCédric,et al.  Iterative optimization in the polyhedral model , 2008 .

[16]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[17]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[18]  Bradford L. Chamberlain,et al.  Portable Performance of Data Parallel Languages , 1997, SC.