SemCache: semantics-aware caching for efficient GPU offloading

Recently, GPU libraries have made it easy to improve application performance by offloading computation to the GPU. However, using such libraries introduces the complexity of manually handling explicit data movements between GPU and CPU memory spaces. Unfortunately, when using these libraries with complex applications, it is very difficult to optimize CPU-GPU communication between multiple kernel invocations to avoid redundant communication. In this paper, we introduce SemCache, a semantics-aware GPU cache that automatically manages CPU-GPU communication and dynamically optimizes communication by eliminating redundant transfers using caching. Its key feature is the use of library semantics to determine the appropriate caching granularity for a given offloaded library (e.g., matrices in BLAS). We applied SemCache to BLAS libraries to provide a GPU drop-in replacement library which handles communications and optimizations automatically. Our caching technique is efficient; it only tracks matrices instead of tracking every memory access at fine granularity. Experimental results show that our system can dramatically reduce redundant communication for real-world computational science application and deliver significant performance improvements, beating GPU-based implementations like CULA and CUBLAS.

[1]  Of references. , 1966, JAMA.

[2]  Bill Nitzberg,et al.  Distributed shared memory: a survey of issues and algorithms , 1991, Computer.

[3]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[4]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[5]  Yves Robert,et al.  A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers) , 2001, IEEE Trans. Computers.

[6]  A. Prakash,et al.  A FETI‐based multi‐time‐step coupling method for Newmark schemes in structural dynamics , 2004 .

[7]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[8]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[9]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[10]  Eduard Ayguadé,et al.  Hybrid access-specific software cache techniques for the cell BE architecture , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[13]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[14]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[15]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS 2010.

[17]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[18]  Robert A. van de Geijn,et al.  Retargeting PLAPACK to clusters with hardware accelerators , 2010, 2010 International Conference on High Performance Computing & Simulation.

[19]  Eric J. Kelmelis,et al.  CULA: hybrid GPU accelerated linear algebra routines , 2010, Defense + Commercial Sensing.

[20]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[21]  Bronis R. de Supinski,et al.  OpenMP for Accelerators , 2011, IWOMP.

[22]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[23]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[24]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[25]  Anand Raghunathan,et al.  MDR: performance model driven runtime for heterogeneous parallel platforms , 2011, ICS '11.

[26]  Arun Chauhan,et al.  Automating GPU computing in MATLAB , 2011, ICS '11.

[27]  Mickeal Verschoor,et al.  Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs , 2012, Parallel Comput..

[28]  Martin Uecker,et al.  A Multi-GPU Programming Library for Real-Time Applications , 2012, ICA3PP.

[29]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[30]  Feng Liu,et al.  Dynamically managed data for CPU-GPU architectures , 2012, CGO '12.

[31]  R. Govindarajan,et al.  Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[33]  Laxmikant V. Kalé,et al.  G-Charm: an adaptive runtime system for message-driven parallel applications on hybrid systems , 2013, ICS '13.

[34]  Arun Prakash,et al.  Exploiting domain knowledge to optimize parallel computational mechanics codes , 2013, ICS '13.

[35]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[36]  Thomas F. Wenisch,et al.  Unlocking bandwidth for GPUs in CC-NUMA systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[37]  Yi-Ping You,et al.  VirtCL: a framework for OpenCL device abstraction and management , 2015, PPoPP.

[38]  Milind Kulkarni,et al.  SemCache++: Semantics-Aware Caching for Efficient Multi-GPU Offloading , 2015, ICS.