论文信息 - Efficient Fork-Join on GPUs Through Warp Specialization

Efficient Fork-Join on GPUs Through Warp Specialization

Graphics Processing Units (GPUs) are increasingly used to accelerate portions of general-purpose applications. Higher level language extensions have been proposed to help non-experts bridge the gap between a host and the GPU's threading model. Recent updates to the OpenMP standard allow a user to parallelize code on a GPU using the well known fork-join programming model for CPUs. Mapping this model to the architecturally visible threading model of typical GPUs has been challenging. In this work we propose a novel approach using the technique of Warp Specialization. We show how to specialize one warp (a unit of 32 GPU threads) to handle sequential code on a GPU. When this master warp reaches a user-specified parallel region, it awakens unused GPU warps to collectively execute the parallel code. Based on this method, we have implemented a Clang-based, OpenMP 4.5 compliant, open source compiler for GPUs. Our work achieves a 3.6x (and up to 32x) performance improvement over a baseline that does not exploit fork-join parallelism on an NVIDIA k40m GPU across a set of 25 kernels. Compared to state-of-the-art compilers (Clang-ykt, GCC-OpenMP, GCC-OpenACC) our work is 2.1 - 7.6x faster. Our proposed technique is simpler to implement, robust, and performant.

[1] Brucek Khailany,et al. CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2] Yi Yang,et al. CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications , 2015, Journal of Computer Science and Technology.

[3] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[4] Sunita Chandrasekaran,et al. Compiling a High-Level Directive-Based Programming Model for GPGPUs , 2013, LCPC.

[5] Francisco de Sande,et al. accULL: An OpenACC Implementation with CUDA and OpenCL Support , 2012, Euro-Par.

[6] Kevin O'Brien,et al. Coordinating GPU Threads for OpenMP 4.0 in LLVM , 2014, 2014 LLVM Compiler Infrastructure in HPC.

[7] Bronis R. de Supinski,et al. Early Experiences with the OpenMP Accelerator Model , 2013, IWOMP.

[8] J. M. Bull,et al. Measuring Synchronisation and Scheduling Overheads in OpenMP , 2007 .

[9] Wu-chun Feng,et al. Design and Evaluation of Scalable Concurrent Queues for Many-Core Architectures , 2015, ICPE.

[10] Yong-Jun Lee,et al. Translating OpenMP Device Constructs to OpenCL Using Unnecessary Data Transfer Elimination , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12] Barbara M. Chapman,et al. Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs , 2017, PMAM@PPoPP.

[13] José Ignacio Benavides Benítez,et al. An optimized approach to histogram computation on GPU , 2012, Machine Vision and Applications.

[14] Seyong Lee,et al. OpenARC: Extensible OpenACC Compiler Framework for Directive-Based Accelerator Programming Study , 2014, 2014 First Workshop on Accelerator Programming using Directives.

[15] Kevin Skadron,et al. Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[16] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..