论文信息 - GPUdrive: Reconsidering Storage Accesses for GPU Acceleration

GPUdrive: Reconsidering Storage Accesses for GPU Acceleration

GPU-accelerated data-intensive applications demonstrate in excess of ten-fold speedups over CPU-only approaches. However, file-driven data movement between the CPU and the GPU can degrade performance and energy efficiencies by an order of magnitude as a result of traditional storage latency and ineffectual memory management. In this paper, we first analyze these two critical performance bottlenecks in GPU-accelerated data processing. We then study design considerations to reduce the overheads imposed by file-driven data movements in GPU computing. To address these issues, we prototype a low cost and low power all-flash array designed specifically for stream-based, I/O-rich workloads inherent in GPUs. As preliminary evaluation results, we demonstrate that our early-stage all-flash array solution can eliminate 60% ∼ 90% performance discrepancy between memory-level GPU data transfer rates and storage access bandwidth by removing unnecessary data copies, memory management, and user/kernel-mode switching in the current system software stack. In addition, our allflash array prototype consumes less dynamic power than the baseline storage array by 49%, on average.

[1] Shinpei Kato,et al. Zero-copy I/O processing for low-latency GPU computing , 2013, 2013 ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).

[2] Bingsheng He,et al. Mars: Accelerating MapReduce with Graphics Processors , 2011, IEEE Transactions on Parallel and Distributed Systems.

[3] Meichun Hsu,et al. GPU-Accelerated Large Scale Analytics , 2009 .

[4] Fabrizio Silvestri,et al. Sorting using BItonic netwoRk wIth CUDA , 2009, LSDS-IR@SIGIR.

[5] John D. Owens,et al. Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[6] Mahmut T. Kandemir,et al. Revisiting widely held SSD expectations and rethinking system-level implications , 2013, SIGMETRICS '13.

[7] Shinpei Kato,et al. RGEM: A Responsive GPGPU Execution Model for Runtime Engines , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[8] Michael Garland,et al. Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[9] Kim M. Hazelwood,et al. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[10] Margaret Martonosi,et al. Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).