Kernel composition in SYCL

Parallel primitives libraries reduce the burden of knowledge required for developers to begin developing parallel applications and accelerating them with OpenCL. Unfortunately some current libraries implement primitives as individual kernels and so incur a high performance cost in off-chip memory operations for intermediate variables. We describe a methodology for creating efficient domain specific embedded languages on top of the SYCL for OpenCL standard for parallel programming. Using this approach, a small example language was developed which provides an environment for composing image processing pipelines from a library of more primitive operations, while retaining the capability to generate a single kernel from a complex expression, and so eliminate unnecessary intermediate loads and stores to global memory. This elimination of global memory accesses leads to a 2.75x speedup over implementing an unsharp mask in OpenCLIPP. We give details of our domain specific embedded language, and provide experimental performance measurements of both primitive performance and an unsharp mask operation composed of multiple primitives.

[1]  Wei Yi,et al.  Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[2]  Eric Niebler,et al.  Proto: a compiler construction toolkit for DSELs , 2007, LCSD '07.

[3]  Sergei Gorlatch,et al.  HLSF: A High-Level; C++-Based Framework for Stencil Computations on Accelerators , 2014 .

[4]  Sergei Gorlatch,et al.  PACXX: Towards a Unified Programming Model for Programming Accelerators Using C++14 , 2014, 2014 LLVM Compiler Infrastructure in HPC.

[5]  Mohamed Wahib,et al.  Scalable Kernel Fusion for Memory-Bound GPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[7]  T. Aaron Gulliver,et al.  KFusion: optimizing data flow without compromising modularity , 2013, AOSD.

[8]  Jiri Filipovic,et al.  Automatic fusions of CUDA-GPU kernels for parallel map , 2011, CARN.

[9]  Krunal Patel,et al.  ArrayFire: a GPU acceleration platform , 2012, Defense, Security, and Sensing.