A Case Study on the HACCmk Routine in SYCL on Integrated Graphics

As opposed to the Open Computing Language (OpenCL) programming model in which host and device codes are generally written in different languages, the SYCL programming model can combine host and device codes for an application in a type-safe way to improve development productivity. In this paper, we chose the HACCmk routine, a representative compute-bound kernel, as a case study on the performance of the SYCL programming model targeting a heterogeneous computing device. More specifically, we introduced the SYCL programming model, presented the OpenCL and SYCL implementations of the routine, and compared the performance of the two implementations using the offline and online compilation on Intelo Iri$\mathrm{s}^{\mathrm{T}\mathrm{M}}$ Pro integrated GPUs. We found that the overhead of online compilation may become significant compared to the execution time of a kernel. Compared to the performance of OpenCL implementations, the SYCL implementation can maintain the performance using the offline compilation. The number of execution units in a GPU are critical to improving the raw performance of a compute-bound kernel.

[1]  Mehdi Goli,et al.  VisionCPP: A SYCL-based Computer Vision Framework , 2016, IWOCL.

[2]  Prasanna Balaprakash,et al.  Analytical Performance Modeling and Validation of Intel's Xeon Phi Architecture , 2017, Conf. Computing Frontiers.

[3]  Jack J. Dongarra,et al.  Towards Achieving Performance Portability Using Directives for Accelerators , 2016, 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD).

[4]  Sunita Chandrasekaran,et al.  OpenACC for Programmers: Concepts and Strategies , 2017 .

[5]  Flavia Pisani,et al.  A Comparative Study of SYCL, OpenCL, and OpenMP , 2016, 2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW).

[6]  Barbara Chapman,et al.  Using OpenMP - portable shared memory parallel programming , 2007, Scientific and engineering computation.

[7]  Mehdi Goli,et al.  Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels , 2019, ArXiv.

[8]  Ronan Keryell,et al.  SYCL C++ and OpenCL interoperability experimentation with triSYCL , 2017, IWOCL.

[9]  Prasun Gera,et al.  Performance Characterisation and Simulation of Intel's Integrated GPU Architecture , 2018, 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  John Lawson,et al.  Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN , 2019, IWOCL.

[11]  Denis Barthou,et al.  A Stencil DSEL for Single Code Accelerated Computing with SYCL , 2016, PPoPP 2016.

[12]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[13]  Andrew Richards,et al.  Accelerated Machine Learning Using TensorFlow and SYCL on OpenCL Devices , 2017, IWOCL.

[14]  Hartmut Kaiser,et al.  Using SYCL as an Implementation Framework for HPX.Compute , 2017, IWOCL.

[15]  Jürgen Teich,et al.  Solving Maxwell's Equations with Modern C++ and SYCL: A Case Study , 2018, 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[16]  Verónica G. Vergara Larrea,et al.  Experiences Evaluating Functionality and Performance of IBM POWER8+ Systems , 2017, ISC Workshops.

[17]  Oscar R. Hernandez,et al.  The Technological Roadmap of Parallware and Its Alignment with the OpenPOWER Ecosystem , 2017, ISC Workshops.

[18]  Hal Finkel,et al.  Evaluating an OpenCL FPGA Platform for HPC: a Case Study with the HACCmk Kernel , 2018, 2018 IEEE High Performance extreme Computing Conference (HPEC).

[19]  Ralph Potter,et al.  Kernel composition in SYCL , 2015, IWOCL.

[20]  Oscar R. Hernandez,et al.  Effective Vectorization with OpenMP 4.5 , 2017 .

[21]  José Ignacio Aliaga,et al.  SYCL-BLAS: Leveraging Expression Trees for Linear Algebra , 2017, IWOCL.