论文信息 - Performance analysis of OpenMP on a GPU using a CORAL proxy application

Performance analysis of OpenMP on a GPU using a CORAL proxy application

OpenMP provides high-level parallel abstractions for programing heterogeneous systems based on acceleration technology. Active areas of research are looking to characterise the performance that can be expected from even the simplest combinations of directives and how they compare to versions manually implemented and tuned to a specific hardware accelerator. In this paper we analyze the performance of our implementation of the OpenMP 4.0 constructs on an NVIDIA GPU. For performance analysis we use LULESH, a complex proxy application provided by the Department of Energy as part of the CORAL benchmark suite. NVIDIA provides CUDA as a native programming model for GPUs. We compare the performance of an OpenMP 4.0 version of LULESH obtained from a pre-existing OpenMP implementation with a functionally equivalent CUDA implementation. Alongside our performance analysis we also present the tuning steps required to obtain good performance when porting existing applications to a new accelerator architecture. Based on the analysis of the performance characteristics of our application we present an extension to the compiler code-synthesis process for combined OpenMP 4.0 offloading directives. The results obtained using our OpenMP compilation toolchain show performance within as low as 10% of native CUDA C/C++ for application kernels with low register counts.

[1] Martin Schulz,et al. Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[2] Kevin O'Brien,et al. Integrating GPU support for OpenMP offloading directives into Clang , 2015, LLVM '15.

[3] Yi Yang,et al. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.

[4] Kevin O'Brien,et al. Coordinating GPU Threads for OpenMP 4.0 in LLVM , 2014, 2014 LLVM Compiler Infrastructure in HPC.

[5] Ian Karlin,et al. LULESH Programming Model and Performance Ports Overview , 2012 .

[6] Eduard Ayguadé,et al. On the Roles of the Programmer, the Compiler and the Runtime System When Programming Accelerators in OpenMP , 2014, IWOMP.