VectorPU: A Generic and Efficient Data-container and Component Model for Transparent Data Transfer on GPU-based Heterogeneous Systems

We present VectorPU, a C++ based programming framework providing high-level and efficient unified memory access on heterogeneous systems, in particular GPU-based systems. VectorPU consists of a light-weight runtime library providing a generic, "smart" data-container abstraction for transparent software caching of array operands with programmable memory coherence, and a light-weight component model realized by macro-based data access annotations. VectorPU thereby enables a flexible unified memory view with data transfer and device memory management abstracted away from programmers, while keeping the efficiency of expert-written code with manual data movement and memory management. We provide a prototype of VectorPU for (CUDA) GPU-based systems, and show that it can achieve 1.40× to 13.29× speedup over good quality code using Nvidia's Unified Memory by experiments on several machines ranging from laptops to supercomputer nodes, with Kepler and Maxwell GPUs. We also show the expressiveness and wide applicability of VectorPU, and its low overhead and equal efficiency compared to expert-written code.

[1]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[2]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Christoph W. Kessler,et al.  XPDL: Extensible Platform Description Language to Support Energy Modeling and Optimization , 2015, 2015 44th International Conference on Parallel Processing Workshops.

[4]  Clemens Grelck,et al.  Towards Heterogeneous Computing without Heterogeneous Programming , 2012, Trends in Functional Programming.

[5]  Vivek Sarkar,et al.  Compiling and Optimizing Java 8 Programs for GPU Execution , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[6]  Dong Li,et al.  Interactive Program Debugging and Optimization for Directive-Based, Efficient GPU Computing , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[7]  Feng Liu,et al.  Dynamically managed data for CPU-GPU architectures , 2012, CGO '12.

[8]  R. Govindarajan,et al.  Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[10]  Christoph W. Kessler,et al.  Smart Containers and Skeleton Programming for GPU-Based Systems , 2015, International Journal of Parallel Programming.

[11]  Christoph Kessler,et al.  MeterPU: A Generic Measurement Abstraction API Enabling Energy-Tuned Skeleton Backend Selection , 2015, TrustCom 2015.

[12]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[13]  Raphael Landaverde,et al.  An investigation of Unified Memory Access performance in CUDA , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[14]  Milind Kulkarni,et al.  SemCache: semantics-aware caching for efficient GPU offloading , 2016, ICS '13.