Achieving a single compute device image in OpenCL for multiple GPUs

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.

[1]  Jong-Deok Choi,et al.  An OpenCL framework for heterogeneous multicores with local memory , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Kai Lu,et al.  Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing , 2010, 2010 IEEE International Conference on Cluster Computing.

[3]  Bixia Zheng,et al.  Twin Peaks: A Software Platform for Heterogeneous Computing on General-Purpose and Graphics Processors , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Klaus Schulten,et al.  Adapting a message-driven parallel application to GPU-accelerated clusters , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[6]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[7]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[8]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  David W. Binkley,et al.  Program slicing , 2008, 2008 Frontiers of Software Maintenance.

[10]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  David R. Kaeli,et al.  Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[12]  Frank Tip,et al.  A survey of program slicing techniques , 1994, J. Program. Lang..

[13]  Frederica Darema,et al.  The SPMD Model : Past, Present and Future , 2001, PVM/MPI.

[14]  Thomas Ertl,et al.  CUDASA: Compute Unified Device and Systems Architecture , 2008, EGPGV@Eurographics.

[15]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[16]  Steven S. Lumetta,et al.  CUBA: an architecture for efficient CPU/co-processor data communication , 2008, ICS '08.

[17]  Rob van Nieuwpoort,et al.  The LOFAR correlator: implementation and performance analysis , 2010, PPoPP '10.

[18]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[19]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.