Exploring shared memory and cache to improve GPU performance and energy efficiency

Graphic Processing Units(GPU) use multiple, multithreaded, SIMD cores to exploit data parallelism to boost performance. State-of-the-art GPUs use configurable shared memory and cache to improve performance for applications with different access patterns. Unlike CPU programs, GPU programs usually exhibit different access patterns, whose performance may not be heavily dependent on the cache access latencies. On the other hand, the shared memory capacity and other execution resources may become limiting factors to the parallelism, which can significantly affect performance. In this paper, we evaluate the impact of different shared memory and cache configurations on both the performance and energy consumption, which can provide useful insights for GPU programmers to use the configurable shared memory and cache more effectively.

[1]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[2]  Majid Sarrafzadeh,et al.  A memory optimization technique for software-managed scratchpad memory in GPUs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[3]  Margaret Martonosi,et al.  Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.

[4]  P. J. Narayanan,et al.  Accelerating Large Graph Algorithms on the GPU Using CUDA , 2007, HiPC.

[5]  Yi Yang,et al.  Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).