Directive-Based Pipelining Extension for OpenMP

Programming models like CUDA, OpenMP, OpenACC and OpenCL are designed to offload compute-intensive workloads to accelerators efficiently. However, the naive offload model, which synchronously copies and executes in sequence, requires extensive hand-tuning of techniques, such as pipelining to overlap computation and communication. Therefore, we propose an easy-to-use, directive-based pipelining extension for OpenMP to overlap data transfers and kernel computation. This extension can map data to a pre-allocated device buffer and can automate memory-constrained array indexing and sub-task scheduling. We evaluate a prototype implementation of our approach with three different applications. The experimental results show that our approach can reduce memory usage by 52% to 97% while delivering a 1:41X to 1:65X speedup over the naive offload model.

[1]  Jack Dongarra,et al.  GPU-Aware Non-contiguous Data Movement In Open MPI , 2016, HPDC.

[2]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[3]  Wu-chun Feng,et al.  On the Programmability and Performance of Heterogeneous Platforms , 2013, 2013 International Conference on Parallel and Distributed Systems.

[4]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[5]  Christian Terboven,et al.  OpenACC - First Experiences with Real-World Applications , 2012, Euro-Par.

[6]  Bronis R. de Supinski,et al.  Early Experiences with the OpenMP Accelerator Model , 2013, IWOMP.

[7]  Wu-chun Feng,et al.  Delivering Parallel Programmability to the Masses via the Intel MIC Ecosystem: A Case Study , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[8]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[9]  Simon See,et al.  在Intel Knights Corner和NVIDIA Kepler架构上OpenACC的性能可移植性分析 (Performance Portability Evaluation for OpenACC on Intel Knights Corner and NVIDIA Kepler) , 2015, 计算机科学.

[10]  Cédric Augonnet,et al.  StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines , 2010 .

[11]  Bronis R. de Supinski,et al.  CoreTSAR: Core Task-Size Adapting Runtime , 2015, IEEE Transactions on Parallel and Distributed Systems.

[12]  Bronis R. de Supinski,et al.  Heterogeneous Task Scheduling for Accelerated OpenMP , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[13]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[14]  Hal Finkel,et al.  Supporting Indirect Data Mapping in OpenMP , 2015, IWOMP.

[15]  Wu-chun Feng,et al.  MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.