Performance evaluation of OpenMP's target construct on GPUs - exploring compiler optimisations

OpenMP is a directive-based shared memory parallel programming model and has been widely used for many years. From OpenMP 4.0 onwards, GPU platforms are supported by extending OpenMP's high-level parallel abstractions with accelerator programming. This extension allows programmers to write GPU programs in standard C/C++ or Fortran languages, without exposing too many details of GPU architectures. However, such high-level programming models generally impose additional program optimisations on compilers and runtime systems. Otherwise, OpenMP programs could be slower than fully hand-tuned and even naive implementations with low-level programming models like CUDA. To study potential performance improvements by compiling and optimising high-level programs for GPU execution, in this paper, we: 1) evaluate a set of OpenMP benchmarks on two NVIDIA Tesla GPUs (K80 and P100); 2) conduct a comparable performance analysis among hand-written CUDA and automatically-generated GPU programs by the IBM XL and clang/LLVM compilers.

[1]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[3]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Vivek Sarkar,et al.  Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection , 2015, PPPJ.

[5]  Ondrej Lhoták,et al.  Automatic parallelization for graphics processing units , 2009, PPPJ '09.

[6]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[7]  Vivek Sarkar,et al.  Exploring Compiler Optimization Opportunities for the OpenMP 4.× Accelerator Model on a POWER8+GPU Platform , 2016, 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD).

[8]  Vivek Sarkar,et al.  Optimized two-level parallelization for GPU accelerators using the polyhedral model , 2017, CC.

[9]  Kevin O'Brien,et al.  Coordinating GPU Threads for OpenMP 4.0 in LLVM , 2014, 2014 LLVM Compiler Infrastructure in HPC.

[10]  Vivek Sarkar,et al.  Compiling and Optimizing Java 8 Programs for GPU Execution , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[11]  Yong-Jun Lee,et al.  Translating OpenMP Device Constructs to OpenCL Using Unnecessary Data Transfer Elimination , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Nicolas Vasilache,et al.  Joint Scheduling and Layout Optimization to Enable Multi-Level Vectorization , 2012 .

[13]  Kevin O'Brien,et al.  Integrating GPU support for OpenMP offloading directives into Clang , 2015, LLVM '15.

[14]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[15]  Alistair P. Rendell,et al.  Implementation and Optimization of the OpenMP Accelerator Model for the TI Keystone II Architecture , 2014, IWOMP.

[16]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[17]  Vivek Sarkar,et al.  Accelerating Habanero-Java programs with OpenCL generation , 2013, PPPJ.

[18]  Justin P. Haldar,et al.  Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[19]  Tian Jin,et al.  Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support , 2016, 2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[20]  Benoît Meister,et al.  A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction , 2010, GPGPU-3.

[21]  Laurie J. Hendren,et al.  Velociraptor: An embedded compiler toolkit for numerical programs targeting CPUs and GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[22]  Tian Jin,et al.  Offloading Support for OpenMP in Clang and LLVM , 2016, 2016 Third Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC).

[23]  Kevin O'Brien,et al.  Performance analysis of OpenMP on a GPU using a CORAL proxy application , 2015, PMBS '15.