论文信息 - Coordinating GPU Threads for OpenMP 4.0 in LLVM

Coordinating GPU Threads for OpenMP 4.0 in LLVM

GPUs devices are becoming critical building blocks of High-Performance platforms for performance and energy efficiency reasons. As a consequence, parallel programming environment such as OpenMP were extended to support offloading code to such devices. OpenMP compilers are faced with offering an efficient implementation of device-targeting constructs. One main issue in implementing OpenMP on a GPU is related to efficiently supporting sequential and parallel regions, as GPUs are only optimized to execute highly parallel workloads. Multiple solutions to this issue were proposed in previous research. In this paper, we propose a method to coordinate threads in an NVIDIA GPU that is both efficient and easily integrated as part of a compiler. To support our claims, we developed CUDA programs that mimic multiple coordination schemes and we compare their performances. We show that a scheme based on dynamic parallelism performs poorly compared to inspector-executor schemes that we introduce in this paper. We also discuss how to integrate these schemes to the LLVM compiler infrastructure.

[1] Matthias S. Müller,et al. OpenMP in the Era of Low Power Devices and Accelerators , 2013, Lecture Notes in Computer Science.

[2] Alejandro Duran,et al. An OpenMP* Barrier Using SIMD Instructions for Intel® Xeon PhiTM Coprocessor , 2013, IWOMP.

[3] Bronis R. de Supinski,et al. Early Experiences with the OpenMP Accelerator Model , 2013, IWOMP.

[4] David A. Ham,et al. Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs , 2012, LCPC.

[5] Yi Yang,et al. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.

[6] Alistair P. Rendell,et al. OpenMP on the Low-Power TI Keystone II ARM/DSP System-on-Chip , 2013, IWOMP.

[7] Larry Carter,et al. Sparse Tiling for Stationary Iterative Methods , 2004, Int. J. High Perform. Comput. Appl..

[8] Eric Darve,et al. Liszt: A domain specific language for building portable mesh-based PDE solvers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9] P. Sadayappan,et al. High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.