A Sample-Based Dynamic CPU and GPU LLC Bypassing Method for Heterogeneous CPU-GPU Architectures

Heterogeneous multicore processors with integrated CPU and GPU (Graphic Processing Units) cores on the same chip post new challenges for resources sharing, which is crucial for performance. Unlike traditional multicores, the CPU and GPU cores in the integrated architecture can generate significantly different numbers of cache traffics and exhibit quite diverse temporal or spatial data locality. The shared last-level cache (LLC) can result in a large number of interferences between CPU and GPU LLC accesses, thus impacting the performance of both CPUs and GPUs. Cache bypassing is a promising method to improve LLC performance and to alleviate resource contention between CPU and GPU. However, inefficient cache bypassing may lead to significant NoC (Network-on-Chip) traffic congestion and hence performance degradation, particularly for the CPU on the heterogeneous CPU-GPU system with the on-chip ring network. In this paper, we propose a sample-based dynamic cache bypassing method for shared LLC in the heterogeneous CPUGPU multicore system. This method samples the LLC miss rates and NoC traffics for both CPU and GPU at run-time and uses a statistical bypassing decision-making model to intelligently decide whether to bypass or not. Our experiments show that instead of bypassing GPU, bypassing CPU can be even more important than bypassing GPU for the integrated CPU-GPU architecture with the ring-based NoC topology. The results indicate that bypassing both CPU and GPU can improve CPU performance by 34.30% and GPU performance by 3.20%, bypassing CPU alone enhances CPU performance by 38.09% and GPU performance by 1.11%, and bypassing GPU alone increases CPU performance by 4.12% and GPU performance by 2.60% on average.

[1]  Antonia Zhai,et al.  Managing shared last-level cache in a heterogeneous multicore processor , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[2]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.

[3]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[4]  M. Atkins Performance and the i860 microprocessor , 1991, IEEE Micro.

[5]  Hyesoon Kim,et al.  TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[6]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Aamer Jaleel,et al.  High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[8]  Cloyce D. Spradling SPEC CPU2006 benchmark tools , 2007, CARN.

[9]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[10]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[11]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[12]  Francisco J. Cazorla,et al.  MLP-Aware Dynamic Cache Partitioning , 2008, HiPEAC.

[13]  Chia-Lin Yang,et al.  Full system simulation framework for integrated CPU/GPU architecture , 2014, Technical Papers of 2014 International Symposium on VLSI Design, Automation and Test.