Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance

To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30 to 67 percent for a modern baseline GPU card, and from 32 to 118 percent for a larger GPU. In addition, energy consumption is reduced on average from 49 to 57 percent for the larger GPU. These benefits come with a small area increase (by 7.3 percent) over the LLC baseline.

[1]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, ICCAD 2013.

[3]  Yu Wang,et al.  Optimizing Cache Bypassing and Warp Scheduling for GPUs , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  Scott A. Mahlke,et al.  Mascar: Speeding up GPU warps by reducing memory pitstops , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[5]  Mattan Erez,et al.  A locality-aware memory hierarchy for energy-efficient GPU architectures , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Jianfei Wang,et al.  Incorporating selective victim cache into GPGPU for high‐performance computing , 2017, Concurr. Comput. Pract. Exp..

[7]  Julio Sahuquillo,et al.  Improving GPU Cache Hierarchy Performance with a Fetch and Replacement Cache , 2018, Euro-Par.

[8]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[9]  Song Huang,et al.  On the energy efficiency of graphics processing units for scientific computing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[10]  Jizhou Sun,et al.  Elastic-Cache: GPU Cache Architecture for Efficient Fine- and Coarse-Grained Cache-Line Management , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[13]  Jose Renau,et al.  An energy efficient GPGPU memory hierarchy with tiny incoherent caches , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[14]  Won Woo Ro,et al.  APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[15]  Jung Ho Ahn,et al.  A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies , 2008, 2008 International Symposium on Computer Architecture.

[16]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[17]  Mohammad Arjomand,et al.  Architecting the Last-Level Cache for GPUs using STT-RAM Technology , 2015, ACM Trans. Design Autom. Electr. Syst..

[18]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Daniel W. Chang,et al.  Studying Victim Caches in GPUs , 2018, 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP).

[20]  Kyu Yeun Kim,et al.  IACM: Integrated adaptive cache management for high-performance and energy-efficient GPGPU computing , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[21]  Tao Zhang,et al.  Energy-Efficient eDRAM-Based On-Chip Storage Architecture for GPGPUs , 2016, IEEE Transactions on Computers.

[22]  José Duato,et al.  Accurately modeling the on-chip and off-chip GPU memory subsystem , 2017, Future Gener. Comput. Syst..

[23]  William J. Dally,et al.  Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[25]  Zhihua Wang,et al.  Orchestrating Cache Management and Memory Scheduling for GPGPU Applications , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[26]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Sergios Petridis,et al.  Performance and energy characterization of high-performance low-cost cornerness detection on GPUs and multicores , 2014, IISA 2014, The 5th International Conference on Information, Intelligence, Systems and Applications.

[28]  Margaret Martonosi,et al.  MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[29]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  Qin Wang,et al.  IBOM: An Integrated and Balanced On-Chip Memory for High Performance GPGPUs , 2018, IEEE Transactions on Parallel and Distributed Systems.

[31]  Daniel A. Jiménez,et al.  Adaptive GPU cache bypassing , 2015, GPGPU@PPoPP.