论文信息 - Set variation-aware shared LLC management for CPU-GPU heterogeneous architecture

Set variation-aware shared LLC management for CPU-GPU heterogeneous architecture

Heterogeneous CPU-GPU multiprocessor systems-on-chip (HMPSoC) becomes a popular architecture choice for high performance embedded systems, where shared last-level cache (LLC) management becomes a critical design consideration. We observe that within a sampling period, CPU and GPU may have distinct access behaviors over various LLC sets. In this work, we propose a light-weighted and fined-grained cache management policy to cope with the CPU-GPU access behavior variation among cache sets. In particular, CPU and GPU requests are prioritized disparately in each LLC set during cache block insertion and promotion, based on the per-core utility behaviors and a per-set CPU-GPU miss counter. Experimental results show that our LLC management scheme outperforms the two state-of-the-art schemes TAP-RRIP and LSP by 12.6% and 10.01%, respectively.

[1] Xian-He Sun,et al. DaCache: Memory Divergence-Aware GPU Cache Management , 2015, ICS.

[2] Irving L. Traiger,et al. Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[3] Zhiping Jia,et al. Shared last-level cache management for GPGPUs with hybrid main memory , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[4] John L. Henning. SPEC CPU2006 benchmark descriptions , 2006, CARN.

[5] Yu Wang,et al. Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6] Antonia Zhai,et al. Managing shared last-level cache in a heterogeneous multicore processor , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[7] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8] Chia-Lin Yang,et al. Latency sensitivity-based cache partitioning for heterogeneous multi-core architecture , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[9] Zhihua Wang,et al. Orchestrating Cache Management and Memory Scheduling for GPGPU Applications , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[10] David A. Wood,et al. gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[11] Hyesoon Kim,et al. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[12] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13] Luca Benini,et al. GPUguard: Towards supporting a predictable execution model for heterogeneous SoC , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[14] Shuaiwen Song,et al. Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[15] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[16] Yun Liang,et al. An efficient compiler framework for cache bypassing on GPUs , 2013, ICCAD 2013.

[17] Keshav Pingali,et al. Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[18] David A. Wood,et al. Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[20] Chita R. Das,et al. OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21] Gabriel H. Loh,et al. PIPP: promotion/insertion pseudo-partitioning of multi-core shared caches , 2009, ISCA '09.

[22] Sunggu Lee,et al. Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU , 2012, DAC Design Automation Conference 2012.

[23] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24] Dongrui Fan,et al. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs , 2018, MICRO.