Equidistant Memory Access Coalescing on GPGPU

With the massive processing power, GPGPU can execute thousands of threads in parallel at the cost of highmemory bandwidth to support the large number of concurrent memory requests. To alleviate the demands, GPGPU adopts memory access coalescing to reduce the memory requests issued to memory system. In this paper, we first introduced the concept of memory access distance, and classify GPGPU programs into three types according to their memory access distances. We discovered that programs with large but equal memory access distance are popular in GPGPU, which, however, cannot be optimized by the original memory access coalescing. Thus, we proposed equidistant memory access coalescing, which is able to merge requests with any equal memory access distance. We evaluated our method with 30 benchmarks. Compared with original memory access coalescing, equidistant memory access coalescing can improve performance of 19 benchmarks among them. For the benchmarks with equal and large memory access distance, the average speedup is 151% and the maximum speedup is 200%. The memory access requests are reduced to 32% on average.

[1]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[2]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[3]  K. Srinathan,et al.  A performance prediction model for the CUDA GPGPU platform , 2009, 2009 International Conference on High Performance Computing (HiPC).

[4]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[5]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[6]  Fernando Magno Quintão Pereira,et al.  Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads , 2014, Parallel Comput..

[7]  Rajeev Balasubramonian,et al.  Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Tarek S. Abdelrahman,et al.  hiCUDA: High-Level GPGPU Programming , 2011, IEEE Transactions on Parallel and Distributed Systems.

[9]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[10]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[11]  Bo Wu,et al.  Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.

[12]  Yi Yang,et al.  A unified optimizing compiler framework for different GPGPU architectures , 2012, TACO.

[13]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[14]  Yooseong Kim,et al.  CuMAPz: A tool to analyze memory access patterns in CUDA , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).