Towards Automating Multi-dimensional Data Decomposition for Executing a Single-GPU Code on a Multi-GPU System

In this paper, we present a data decomposition method for multi-dimensional data, aiming at realizing multi graphics processing unit (GPU) acceleration of a compute unified device architecture (CUDA) code written for a single GPU. Our multi-dimensional method extends a previous method that deals with one-dimensional (1-D) data. The method performs a sample run of selected GPU threads to decompose large data into small segments, which avoid exhaustion of GPU memory. As compared with the previous method, our multidimensional method produces smaller segments, so that it saves GPU memory consumption and reduces the amount of CPU-GPU data transfer. As a result of experiments using matrix multiplication, the presented method consumed less GPU memory compared with that of the previous method, and thereby successfully processed 29 times larger matrices as long as the matrices fit into CPU memory. However, we found that index transformation needed for multi-dimensional decomposition dropped the effective performance by 28%.

[1]  Thomas B. Jablin,et al.  Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes , 2015, ICS.

[2]  Thomas Ertl,et al.  A Compute Unified System Architecture for Graphics Clusters Incorporating Data Locality , 2009, IEEE Transactions on Visualization and Computer Graphics.

[3]  Jungwon Kim,et al.  Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[4]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[5]  Arun K. Somani,et al.  Automatic Parallelization of GPU Applications Using OpenCL , 2015, 2015 Asia-Pacific Conference on Computer Aided System Engineering.

[6]  Feng Ji,et al.  RSVM: A Region-based Software Virtual Memory for GPU , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[7]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[8]  Anuj Agarwal,et al.  Analysis of sleep traits in knockout mice from the large-scale KOMP2 population using a non-invasive, high-throughput piezoelectric system , 2015, BMC Bioinformatics.

[9]  Rudolf Eigenmann,et al.  OpenMPC: extended OpenMP for efficient programming and tuning on GPUs , 2013, Int. J. Comput. Sci. Eng..

[10]  Fumihiko Ino,et al.  Accelerating the Smith-Waterman algorithm with interpair pruning and band optimization for the all-pairs comparison of base sequences , 2015, BMC Bioinformatics.

[11]  Scott A. Mahlke,et al.  VAST: The illusion of a large memory space for GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[12]  Fumihiko Ino,et al.  PACC : An Extension of OpenACC for Pipelined Processing of Large Data on a GPU , 2014 .

[13]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[14]  Fumihiko Ino,et al.  GPU-Chariot: A Programming Framework for Stream Applications Running on Multi-GPU Systems , 2013, IEICE Trans. Inf. Syst..

[15]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[17]  Fumihiko Ino,et al.  Improving cache locality for GPU-based volume rendering , 2014, Parallel Comput..

[18]  Jonathan Blancas,et al.  NVIDIA GeForce GTX 980 Ti, análisis , 2015 .