GPUdrive: Reconsidering Storage Accesses for GPU Acceleration

GPU-accelerated data-intensive applications demonstrate in excess of ten-fold speedups over CPU-only approaches. However, file-driven data movement between the CPU and the GPU can degrade performance and energy efficiencies by an order of magnitude as a result of traditional storage latency and ineffectual memory management. In this paper, we first analyze these two critical performance bottlenecks in GPU-accelerated data processing. We then study design considerations to reduce the overheads imposed by file-driven data movements in GPU computing. To address these issues, we prototype a low cost and low power all-flash array designed specifically for stream-based, I/O-rich workloads inherent in GPUs. As preliminary evaluation results, we demonstrate that our early-stage all-flash array solution can eliminate 60% ∼ 90% performance discrepancy between memory-level GPU data transfer rates and storage access bandwidth by removing unnecessary data copies, memory management, and user/kernel-mode switching in the current system software stack. In addition, our allflash array prototype consumes less dynamic power than the baseline storage array by 49%, on average.

[1]  Shinpei Kato,et al.  Zero-copy I/O processing for low-latency GPU computing , 2013, 2013 ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).

[2]  Bingsheng He,et al.  Mars: Accelerating MapReduce with Graphics Processors , 2011, IEEE Transactions on Parallel and Distributed Systems.

[3]  Meichun Hsu,et al.  GPU-Accelerated Large Scale Analytics , 2009 .

[4]  Fabrizio Silvestri,et al.  Sorting using BItonic netwoRk wIth CUDA , 2009, LSDS-IR@SIGIR.

[5]  John D. Owens,et al.  Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[6]  Mahmut T. Kandemir,et al.  Revisiting widely held SSD expectations and rethinking system-level implications , 2013, SIGMETRICS '13.

[7]  Shinpei Kato,et al.  RGEM: A Responsive GPGPU Execution Model for Runtime Engines , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[8]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[10]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).