Improving First Level Cache Efficiency for GPUs Using Dynamic Line Protection

A modern Graphics Processing Unit (GPU) utilizes L1 Data (L1D) caches to reduce memory bandwidth requirements and latencies. However, the L1D cache can easily be overwhelmed by many memory requests from GPU function units, which can bottleneck GPU performance. It has been shown that the performance of L1D caches is greatly reduced for many GPU applications as a large amount of L1D cache lines are replaced before they are re-referenced. By examining the cache access patterns of these applications, we observe L1D caches with low associativity have difficulty capturing data locality for GPU applications with diverse reuse patterns. These patterns result in frequent line replacements and low data re-usage. To improve the efficiency of L1D caches, we design a Dynamic Line Protection scheme (DLP) that can both preserve valuable cache lines and increase cache line utilization. DLP collects data reuse information from the L1D cache. This information is used to predict protection distances for each memory instruction at runtime, which helps maintain a balance between exploitation of data locality and over-protection of cache lines with long reuse distances. When all cache lines are protected in a set, redundant cache misses are bypassed to reduce the contention for the set. The evaluation result shows that our proposed solution improves cache hits while reducing cache traffic for cache-insufficient applications, achieving up to 137% and an average of 43% IPC improvement over the baseline.

[1]  P. Glaskowsky NVIDIA ’ s Fermi : The First Complete GPU Computing Architecture , 2009 .

[2]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Samira Manabi Khan,et al.  Sampling Dead Block Prediction for Last-Level Caches , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Kurt Keutzer,et al.  A map reduce framework for programming graphics processors , 2010 .

[5]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[6]  Mateo Valero,et al.  Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[8]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[9]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[10]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[11]  Margaret Martonosi,et al.  Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.

[12]  Daniel A. Jiménez,et al.  Adaptive GPU cache bypassing , 2015, GPGPU@PPoPP.

[13]  Carole-Jean Wu,et al.  Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[14]  John Kim,et al.  Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[15]  Naga K. Govindaraju,et al.  Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  J. Ramanujam,et al.  Automatic C-to-CUDA Code Generation for Affine Programs , 2010, CC.

[17]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[18]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[19]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[20]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[21]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  Yu Wang,et al.  Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[23]  William J. Dally,et al.  Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).