Evaluating the Performance of Integer Sum Reduction on an Intel GPU

Sum reduction is a primitive operation in parallel computing while SYCL is a promising heterogeneous programming language. In this paper, we describe the SYCL implementations of integer sum reduction using atomic functions, shared local memory, vectorized memory accesses, and parameterized workload sizes. Evaluating the reduction kernels shows that we can achieve 1.4X speedup over the open-source implementations of sum reduction for a sufficiently large number of integers on an Intel integrated GPU.

[1]  Hal Finkel,et al.  Nuclear Reactor Simulation on OpenCL FPGA: a Case Study of RSBench , 2018, IWOCL.

[2]  Hal Finkel,et al.  A Case Study of k-means Clustering using SYCL , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[3]  Eddy Z. Zhang,et al.  Massive atomics for massive parallelism on GPUs , 2014, ISMM '14.

[4]  Christian Robert Trott,et al.  Performance Portability of a Wilson Dslash Stencil Operator Mini-App Using Kokkos and SYCL , 2019, 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).

[5]  Thomas Steinke,et al.  Porting a Legacy CUDA Stencil Code to oneAPI , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[6]  Brian Homerding,et al.  Evaluating the Performance of the hipSYCL Toolchain for HPC Kernels on NVIDIA V100 GPUs , 2020, IWOCL.

[7]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[8]  Simon McIntosh-Smith,et al.  Evaluating the performance of HPC-style SYCL applications , 2020, IWOCL.

[9]  George A. Constantinides,et al.  A Case for Work-stealing on FPGAs with OpenCL Atomics , 2016, FPGA.

[10]  Wu-chun Feng,et al.  Performance Characterization and Optimization of Atomic Operations on AMD GPUs , 2011, 2011 IEEE International Conference on Cluster Computing.

[11]  Ben Ashbaugh Debugging and Analyzing Programs Using the Intercept Layer for OpenCL Applications , 2018, IWOCL.

[12]  Roberto Torres,et al.  Algorithmic strategies for optimizing the parallel reduction primitive in CUDA , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[13]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[14]  Hal Finkel,et al.  Evaluation of Medical Imaging Applications using SYCL , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[15]  Yinan Ke,et al.  neoSYCL: a SYCL implementation for SX-Aurora TSUBASA , 2021, HPC Asia.

[16]  Rafael Asenjo,et al.  Efficiency and productivity for decision making on low-power heterogeneous CPU+GPU SoCs , 2020, The Journal of Supercomputing.