论文信息 - Performance evaluation of OpenMP's target construct on GPUs - exploring compiler optimisations

Performance evaluation of OpenMP's target construct on GPUs - exploring compiler optimisations

OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP's high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU architectures. However, such high-level programming models generally impose additional program optimisations on compilers and runtime systems. Otherwise, OpenMP programs could be slower than fully hand-tuned and even naive implementations with low-level programming models like CUDA. To study potential performance improvements by compiling and optimising high-level programs for GPU execution, in this paper, we: 1) evaluate a set of OpenMP benchmarks on two NVIDIA Tesla GPUs (K80 and P100); 2) conduct a comparable performance analysis among hand-written CUDA and automatically-generated GPU programs by the IBM XL and clang/LLVM compilers.

Vivek Sarkar | Jun Shirako | Akihiro Hayashi | Robert Ho | Ettore Tiotto

[1] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2] David F. Bacon,et al. Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[3] Rudolf Eigenmann,et al. OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[4] Vivek Sarkar,et al. Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection , 2015, PPPJ.

[5] Ondrej Lhoták,et al. Automatic parallelization for graphics processing units , 2009, PPPJ '09.

[6] J. Ramanujam,et al. Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[7] Vivek Sarkar,et al. Exploring Compiler Optimization Opportunities for the OpenMP 4.× Accelerator Model on a POWER8+GPU Platform , 2016, 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD).

[8] Vivek Sarkar,et al. Optimized two-level parallelization for GPU accelerators using the polyhedral model , 2017, CC.

[9] Kevin O'Brien,et al. Coordinating GPU Threads for OpenMP 4.0 in LLVM , 2014, 2014 LLVM Compiler Infrastructure in HPC.

[10] Vivek Sarkar,et al. Compiling and Optimizing Java 8 Programs for GPU Execution , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[11] Yong-Jun Lee,et al. Translating OpenMP Device Constructs to OpenCL Using Unnecessary Data Transfer Elimination , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12] Nicolas Vasilache,et al. Joint Scheduling and Layout Optimization to Enable Multi-Level Vectorization , 2012 .

[13] Kevin O'Brien,et al. Integrating GPU support for OpenMP offloading directives into Clang , 2015, LLVM '15.

[14] Aaftab Munshi,et al. The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[15] Alistair P. Rendell,et al. Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture , 2014, IWOMP.

[16] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[17] Vivek Sarkar,et al. Accelerating Habanero-Java programs with OpenCL generation , 2013, PPPJ.

[18] Justin P. Haldar,et al. Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[19] Tian Jin,et al. Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support , 2016, 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[20] Benoît Meister,et al. A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.

[21] Laurie J. Hendren,et al. Velociraptor: An embedded compiler toolkit for numerical programs targeting CPUs and GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[22] Tian Jin,et al. Offloading Support for OpenMP in Clang and LLVM , 2016, 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC).

[23] Kevin O'Brien,et al. Performance analysis of OpenMP on a GPU using a CORAL proxy application , 2015, PMBS '15.