论文信息 - Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators Based on a Domain-Specific Language for Medical Imaging

Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators Based on a Domain-Specific Language for Medical Imaging

An efficient memory bandwidth utilization for GPU accelerators is crucial for memory bound applications. In medical imaging, the performance of many kernels is limited by the available memory bandwidth since only a few operations are performed per pixel. For such kernels only a fraction of the compute power provided by GPU accelerators can be exploited and performance is predetermined by memory bandwidth. As a remedy, this paper investigates the optimal utilization of available memory bandwidth by means of increasing in-flight memory transactions. Instead of doing this manually for different GPU accelerators, the required CUDA and OpenCL code is automatically generated from descriptions in a Domain-Specific Language (DSL) for the considered application domain. Moreover, the DSL is extended to also support global reduction operators. We show that the generated target-specific code improves bandwidth utilization for memory-bound kernels significantly. Moreover, competitive performance compared to the GPU back end of the widely used image processing library OpenCV can be achieved.

Jürgen Teich | Frank Hannig | Wieland Eckert | Richard Membarth | Mario Körner

[1] Jürgen Teich,et al. Generating Device-specific GPU Code for Local Operators in Medical Imaging , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[2] Mark J. Harris,et al. Parallel Prefix Sum (Scan) with CUDA , 2011 .

[3] Guy E. Blelloch,et al. Prefix sums and their applications , 1990 .

[4] John H. Reif,et al. Synthesis of Parallel Algorithms , 1993 .

[5] Cristina Nicolescu,et al. A data and task parallel image processing environment , 2002, Parallel Comput..

[6] Jürgen Teich,et al. Generating GPU Code from a High-Level Representation for Image Processing Kernels , 2010, Euro-Par Workshops.

[7] Henri-Pierre Charles,et al. Efficient data driven run-time code generation , 2004, LCR.

[8] Klaus Mueller,et al. IOP PUBLISHING PHYSICS IN MEDICINE AND BIOLOGY , 2007 .