论文信息 - SemCache++: Semantics-Aware Caching for Efficient Multi-GPU Offloading

SemCache++: Semantics-Aware Caching for Efficient Multi-GPU Offloading

Offloading computations to multiple GPUs is not an easy task. It requires decomposing data, distributing computations and handling communication manually. GPU drop-in libraries (which require no program rewrite) have made it easy to offload computations to multiple GPUs by hiding this complexity inside library calls. Such encapsulation prevents the reuse of data between successive kernel invocations resulting in redundant communication. This limitation exists in multi-GPU libraries like CUBLASXT. In this paper, we introduce SemCache++, a semantics-aware GPU cache that automatically manages communication between the CPU and multiple GPUs in addition to optimizing communication by eliminating redundant transfers using caching. SemCache++ is used to build the first multi-GPU drop-in replacement library that (a) uses the virtual memory to automatically manage and optimize multi-GPU communication and (b) requires no program rewriting or annotations. Our caching technique is efficient; it uses a two level caching directory to track matrices and sub-matrices. Experimental results show that our system can eliminate redundant communication and deliver performance improvements over multi-GPU libraries like StarPU and CUBLASXT.

Milind Kulkarni | Nabeel AlSaber

[1] Yves Robert,et al. A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear Solvers) , 2001, IEEE Trans. Computers.

[2] Eric J. Kelmelis,et al. CULA: hybrid GPU accelerated linear algebra routines , 2010, Defense + Commercial Sensing.

[3] Jack J. Dongarra,et al. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[4] Matemática,et al. Society for Industrial and Applied Mathematics , 2010 .

[5] Jungwon Kim,et al. Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[6] Martin Uecker,et al. A Multi-GPU Programming Library for Real-Time Applications , 2012, ICA3PP.

[7] David I. August,et al. Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[8] Eduard Ayguadé,et al. An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[9] Bronis R. de Supinski,et al. OpenMP for Accelerators , 2011, IWOMP.

[10] Robert A. van de Geijn,et al. Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[11] John E. Stone,et al. An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.