Efficient parallel reduction on GPUs with Hipacc

Hipacc is a domain-specific language for ease of programming image processing applications on hardware accelerators such as GPUs. It relieves the burden of manually porting algorithms to hardware for developers with the help of domain- and architecture-specific knowledge. One fundamental operation in image processing is reduction. Global reduction operators are the building blocks of many widely used algorithms, including image normalization, similarity estimation, etc. This paper presents an efficient approach to perform parallel reductions on GPUs with Hipacc. Our proposed approach benefits from the continuous effort of performance and programmability improvement by hardware vendors, for example, by utilizing the latest low-level primitives from Nvidia. Results show our approach achieves a speedup of up to 3.43 over an existing Hipacc implementation with traditional optimization methods, and a speedup of up to 9.02 over an implementation using the Thrust library from Nvidia.

[1]  Jürgen Teich,et al.  Automatic Kernel Fusion for Image Processing DSLs , 2018, SCOPES.

[2]  Jinjun Xiong,et al.  Accelerating reduction and scan using tensor core units , 2018, ICS.

[3]  Jürgen Teich,et al.  HIPAcc: A Domain-Specific Language and Compiler for Image Processing , 2016, IEEE Transactions on Parallel and Distributed Systems.

[4]  Isaac N. Bankman,et al.  Handbook of medical image processing and analysis , 2009 .

[5]  Jürgen Teich,et al.  From Loop Fusion to Kernel Fusion: A Domain-Specific Approach to Locality Optimization , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[6]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[7]  Roberto Torres,et al.  Algorithmic strategies for optimizing the parallel reduction primitive in CUDA , 2012, 2012 International Conference on High Performance Computing & Simulation (HPCS).

[8]  Shubhabrata Sengupta,et al.  Efficient Parallel Scan Algorithms for GPUs , 2011 .

[9]  Jürgen Teich,et al.  Unveiling kernel concurrency in multiresolution filters on GPUs with an image processing DSL , 2020, GPGPU@PPoPP.

[10]  Uday Bondhugula,et al.  PolyMage: Automatic Optimization for Image Processing Pipelines , 2015, ASPLOS.

[11]  Frédo Durand,et al.  Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines , 2013, PLDI 2013.

[12]  Simon D. Hammond,et al.  Automatic Generation of Warp-Level Primitives and Atomic Instructions for Fast and Portable Parallel Reduction on GPUs , 2019, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).