论文信息 - Software-level scheduling to exploit non-uniformly shared data cache on GPGPU

Software-level scheduling to exploit non-uniformly shared data cache on GPGPU

Data cache is introduced to GPUs to mitigate the irregular memory access problem. But few studies have investigated how to exploit its full potential. In this work, we consider some important GPU applications that feature data sharing across thread blocks. We show that the sharing is not well exploited because current GPU runtime ignores such a factor when scheduling threads. We then present an application-level transformation to remap thread blocks to data on the fly. With the software-level scheduler, thread blocks with much data sharing are scheduled to share the cache on a streaming multiprocessor (SM). Experiments on four benchmarks show 1.23X speedup on average.

Bo Wu | Xipeng Shen | Weilin Wang

[1] Bo Wu,et al. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.

[2] Bo Wu,et al. Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[3] Margaret Martonosi,et al. Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.

[4] Xipeng Shen,et al. On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[5] Xipeng Shen,et al. Does cache sharing on modern CMP matter to the performance of contemporary multithreaded programs? , 2010, PPoPP '10.