Efficient Fork-Join on GPUs Through Warp Specialization

Graphics Processing Units (GPUs) are increasingly used to accelerate portions of general-purpose applications. Higher level language extensions have been proposed to help non-experts bridge the gap between a host and the GPU's threading model. Recent updates to the OpenMP standard allow a user to parallelize code on a GPU using the well known fork-join programming model for CPUs. Mapping this model to the architecturally visible threading model of typical GPUs has been challenging. In this work we propose a novel approach using the technique of Warp Specialization. We show how to specialize one warp (a unit of 32 GPU threads) to handle sequential code on a GPU. When this master warp reaches a user-specified parallel region, it awakens unused GPU warps to collectively execute the parallel code. Based on this method, we have implemented a Clang-based, OpenMP 4.5 compliant, open source compiler for GPUs. Our work achieves a 3.6x (and up to 32x) performance improvement over a baseline that does not exploit fork-join parallelism on an NVIDIA k40m GPU across a set of 25 kernels. Compared to state-of-the-art compilers (Clang-ykt, GCC-OpenMP, GCC-OpenACC) our work is 2.1 - 7.6x faster. Our proposed technique is simpler to implement, robust, and performant.

[1]  Brucek Khailany,et al.  CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  Yi Yang,et al.  CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.

[3]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[4]  Sunita Chandrasekaran,et al.  Compiling a High-Level Directive-Based Programming Model for GPGPUs , 2013, LCPC.

[5]  Francisco de Sande,et al.  accULL: An OpenACC Implementation with CUDA and OpenCL Support , 2012, Euro-Par.

[6]  Kevin O'Brien,et al.  Coordinating GPU Threads for OpenMP 4.0 in LLVM , 2014, 2014 LLVM Compiler Infrastructure in HPC.

[7]  Bronis R. de Supinski,et al.  Early Experiences with the OpenMP Accelerator Model , 2013, IWOMP.

[8]  J. M. Bull,et al.  Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[9]  Wu-chun Feng,et al.  Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures , 2015, ICPE.

[10]  Yong-Jun Lee,et al.  Translating OpenMP Device Constructs to OpenCL Using Unnecessary Data Transfer Elimination , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Barbara M. Chapman,et al.  Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs , 2017, PMAM@PPoPP.

[13]  José Ignacio Benavides Benítez,et al.  An optimized approach to histogram computation on GPU , 2012, Machine Vision and Applications.

[14]  Seyong Lee,et al.  OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study , 2014, 2014 First Workshop on Accelerator Programming using Directives.

[15]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[16]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..