论文信息 - DyCache: Dynamic Multi-Grain Cache Management for Irregular Memory Accesses on GPU

DyCache: Dynamic Multi-Grain Cache Management for Irregular Memory Accesses on GPU

GPU utilizes the wide cache-line (128B) on-chip cache to provide high bandwidth and efficient memory accesses for applications with regularly-organized data structures. However, emerging applications exhibit a lot of irregular control flows and memory access patterns. Irregular memory accesses generate many fine-grain memory accesses to L1 data cache. This mismatching between fine-grain data accesses and the coarse-grain cache design makes the on-chip memory space more constrained and as a result, the frequency of cache line replacement increases and L1 data cache is utilized inefficiently. Fine-grain cache management is proposed to provide efficient cache management to improve the efficiency of data array utilization. Unlike other static fine-grain cache managements, we propose a dynamic multi-grain cache management, called DyCache, to resolve the inefficient use of L1 data cache. Through monitoring the memory access pattern of applications, DyCache can dynamically alter the cache management granularity in order to improve the performance of GPU for applications with irregular memory accesses while not impact the performance for regular applications. Our experiment demonstrates that DyCache can achieve a 40% geometric mean improvement on IPC for applications with irregular memory accesses against the baseline cache (128B), while for applications with regular memory accesses, DyCache does not degrade the performance.

Sheng Ma | Zhiying Wang | Libo Huang | Hui Guo | Yashuai Lü

[1] Olivier Temam,et al. Data caches for superscalar processors , 1997, ICS '97.

[2] Jizhou Sun,et al. Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[3] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[4] Rajesh K. Gupta,et al. Adapting cache line size to application behavior , 1999, ICS '99.

[5] Xuhao Chen,et al. Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[6] Jeffrey B. Rothman,et al. The pool of subsectors cache design , 1999, ICS '99.

[7] Peter M. Kogge,et al. On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications , 2007, IEEE Transactions on Computers.

[8] J. ContiC.,et al. Structural aspects of the system/360 model 85 , 1968 .

[9] Yale N. Patt,et al. Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[10] Keshav Pingali,et al. A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[11] Jack W. Davidson,et al. Memory access coalescing: a technique for eliminating redundant memory accesses , 1994, PLDI '94.

[12] Ching-Yung Lin,et al. GraphBIG: understanding graph computing in the context of industrial solutions , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13] Scott A. Mahlke,et al. WarpPool: Sharing requests with inter-warp coalescing for throughput processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[15] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[16] André Seznec,et al. Decoupled sectored caches: conciliating low tag implementation cost , 1994, ISCA '94.

[17] Naga K. Govindaraju,et al. Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18] J. Ramanujam,et al. Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.