An experimental study on performance portability of OpenCL kernels

Accelerator processors allow energy-efficient computation at high performance, especially for computationintensive applications. There exists a plethora of different accelerator architectures, such as GPUs and the Cell Broadband Engine. Each accelerator has its own programming language, but the recently introduced OpenCL language unifies accelerator programming languages. Hereby, OpenCL achieves functional protability, allowing to reduce the development time of kernels. Functional portability however has limited value without performance portability: the possibility to re-use optimized kernels with good performance. This paper investigates the specificity of code optimizations to accelerator architecture and the severity of lack of performance portability.

[1]  Richard W. Vuduc,et al.  Sparsity: Optimization Framework for Sparse Matrix Kernels , 2004, Int. J. High Perform. Comput. Appl..

[2]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[3]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[4]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[5]  Samuel Williams,et al.  Auto-tuning performance on multicore computers , 2008 .