Set variation-aware shared LLC management for CPU-GPU heterogeneous architecture

Heterogeneous CPU-GPU multiprocessor systems-on-chip (HMPSoC) becomes a popular architecture choice for high performance embedded systems, where shared last-level cache (LLC) management becomes a critical design consideration. We observe that within a sampling period, CPU and GPU may have distinct access behaviors over various LLC sets. In this work, we propose a light-weighted and fined-grained cache management policy to cope with the CPU-GPU access behavior variation among cache sets. In particular, CPU and GPU requests are prioritized disparately in each LLC set during cache block insertion and promotion, based on the per-core utility behaviors and a per-set CPU-GPU miss counter. Experimental results show that our LLC management scheme outperforms the two state-of-the-art schemes TAP-RRIP and LSP by 12.6% and 10.01%, respectively.

[1]  Xian-He Sun,et al.  DaCache: Memory Divergence-Aware GPU Cache Management , 2015, ICS.

[2]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[3]  Zhiping Jia,et al.  Shared last-level cache management for GPGPUs with hybrid main memory , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[4]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[5]  Yu Wang,et al.  Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6]  Antonia Zhai,et al.  Managing shared last-level cache in a heterogeneous multicore processor , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[7]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8]  Chia-Lin Yang,et al.  Latency sensitivity-based cache partitioning for heterogeneous multi-core architecture , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[9]  Zhihua Wang,et al.  Orchestrating Cache Management and Memory Scheduling for GPGPU Applications , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[10]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[11]  Hyesoon Kim,et al.  TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[12]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13]  Luca Benini,et al.  GPUguard: Towards supporting a predictable execution model for heterogeneous SoC , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[14]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[15]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, ICCAD 2013.

[17]  Keshav Pingali,et al.  Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[18]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[20]  Chita R. Das,et al.  OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Gabriel H. Loh,et al.  PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[22]  Sunggu Lee,et al.  Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU , 2012, DAC Design Automation Conference 2012.

[23]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Dongrui Fan,et al.  Enabling coordinated register allocation and thread-level parallelism optimization for GPUs , 2018, MICRO.