论文信息 - Directive-Based Pipelining Extension for OpenMP

Directive-Based Pipelining Extension for OpenMP

Programming models like CUDA, OpenMP, OpenACC and OpenCL are designed to offload compute-intensive workloads to accelerators efficiently. However, the naive offload model, which synchronously copies and executes in sequence, requires extensive hand-tuning of techniques, such as pipelining to overlap computation and communication. Therefore, we propose an easy-to-use, directive-based pipelining extension for OpenMP to overlap data transfers and kernel computation. This extension can map data to a pre-allocated device buffer and can automate memory-constrained array indexing and sub-task scheduling. We evaluate a prototype implementation of our approach with three different applications. The experimental results show that our approach can reduce memory usage by 52% to 97% while delivering a 1:41X to 1:65X speedup over the naive offload model.

[1] Jack Dongarra,et al. GPU-Aware Non-contiguous Data Movement In Open MPI , 2016, HPDC.

[2] Dhabaleswar K. Panda,et al. GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[3] Wu-chun Feng,et al. On the Programmability and Performance of Heterogeneous Platforms , 2013, 2013 International Conference on Parallel and Distributed Systems.

[4] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[5] Christian Terboven,et al. OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[6] Bronis R. de Supinski,et al. Early Experiences with the OpenMP Accelerator Model , 2013, IWOMP.

[7] Wu-chun Feng,et al. Delivering Parallel Programmability to the Masses via the Intel MIC Ecosystem: A Case Study , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[8] Alejandro Duran,et al. Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[9] Simon See,et al. 在Intel Knights Corner和NVIDIA Kepler架构上OpenACC的性能可移植性分析 (Performance Portability Evaluation for OpenACC on Intel Knights Corner and NVIDIA Kepler) , 2015, 计算机科学.

[10] Cédric Augonnet,et al. StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines , 2010 .

[11] Bronis R. de Supinski,et al. CoreTSAR: Core Task-Size Adapting Runtime , 2015, IEEE Transactions on Parallel and Distributed Systems.

[12] Bronis R. de Supinski,et al. Heterogeneous Task Scheduling for Accelerated OpenMP , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[13] Sayantan Sur,et al. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[14] Hal Finkel,et al. Supporting Indirect Data Mapping in OpenMP , 2015, IWOMP.

[15] Wu-chun Feng,et al. MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.