Shared memory multiplexing: A novel way to improve GPGPU throughput

On-chip shared memory (a.k.a. local data share) is a critical resource to many GPGPU applications. In current GPUs, the shared memory is allocated when a thread block (also called a workgroup) is dispatched to a streaming multiprocessor (SM) and is released when the thread block is completed. As a result, the limited capacity of shared memory becomes a bottleneck for a GPU to host a high number of thread blocks, limiting the otherwise available thread-level parallelism (TLP). In this paper, we propose software and/or hardware approaches to multiplex the shared memory among multiple thread blocks. Our proposed approaches are based on our observation that the current shared memory management reserves shared memory too conservatively, for the entire lifetime of a thread block. If the shared memory is allocated only when it is actually used and freed immediately after, more thread blocks can be hosted in an SM without increasing the shared memory capacity. We propose three software approaches to enable shared memory multiplexing and implement them using a source-to-source compiler. The experimental results show that our proposed software approaches effectively improve the throughput of many GPGPU applications on both NVIDIA GTX285 and GTX480 GPUs (an average of 1.44X on GTX285, 1.70X on GTX480 with 16kB shared memory, and 1.26X on GTX480 with 48kB shared memory). We also propose hardware support for shared memory multiplexing, which incurs minor hardware changes to existing hardware and enables significant performance improvements (an average of 1.53X) to be achieved with very little change in GPGPU code.

[1]  References , 1971 .

[2]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[3]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[4]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[5]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[6]  Yi Yang,et al.  Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs , 2012, 2012 41st International Conference on Parallel Processing.

[7]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[8]  Wen-mei W. Hwu,et al.  Program optimization space pruning for a multithreaded gpu , 2008, CGO '08.

[9]  Matei Ripeanu,et al.  StoreGPU: exploiting graphics processing units to accelerate distributed storage systems , 2008, HPDC '08.

[10]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[11]  Yao Zhang,et al.  Fast tridiagonal solvers on the GPU , 2010, PPoPP '10.

[12]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Rudolf Eigenmann,et al.  Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[14]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.