Tag-Split Cache for Efficient GPGPU Cache Utilization

Modern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using smaller cache lines could improve cache space utilization, but it also frequently suffers from significant performance loss by introducing large amount of extra cache requests. In this work, we propose a novel cache design named tag-split cache (TSC) that enables fine-grained cache storage to address the problem of cache space underutilization while keeping memory request number unchanged. TSC divides tag into two parts to reduce storage overhead, and it supports multiple cache line replacement in one cycle. TSC can also automatically adjust cache storage granularity to avoid performance loss for applications with good spatial locality. Our evaluation shows that TSC improves the baseline cache performance by 17.2% on average across a wide range of applications. It also out-performs other previous techniques significantly.

[1]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[2]  Scott A. Mahlke,et al.  Mascar: Speeding up GPU warps by reducing memory pitstops , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[3]  Jung Ho Ahn,et al.  MAGE: Adaptive Granularity and ECC for resilient and power efficient memory systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Yu Wang,et al.  Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[5]  Raghuram Srinivasan,et al.  Efficient management of last-level caches in graphics processors for 3D scene rendering workloads , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Margaret Martonosi,et al.  MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[7]  Doe Hyun Yoon,et al.  Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[8]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Margaret Martonosi,et al.  Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.

[10]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[11]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[12]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[13]  Sandhya Dwarkadas,et al.  Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[14]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[15]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[16]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[17]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[18]  Yale N. Patt,et al.  Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[19]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[20]  Mattan Erez,et al.  A locality-aware memory hierarchy for energy-efficient GPU architectures , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Doe Hyun Yoon,et al.  The dynamic granularity memory system , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[22]  Xu Cheng,et al.  Optimal bypass monitor for high performance last-level caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[24]  Jeffrey B. Rothman,et al.  The pool of subsectors cache design , 1999, ICS '99.

[25]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[26]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[27]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[28]  Kazuaki Murakami,et al.  Dynamically variable line-size cache exploiting high on-chip memory bandwidth of merged DRAM/logic LSIs , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[29]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[30]  André Seznec,et al.  Decoupled sectored caches: conciliating low tag implementation cost , 1994, ISCA '94.

[31]  Mateo Valero,et al.  A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality , 1995, International Conference on Supercomputing.

[32]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[33]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[34]  Rajesh K. Gupta,et al.  Adapting cache line size to application behavior , 1999, ICS '99.