Analyzing Memory Access on CPU-GPGPU Shared LLC Architecture

The data exchange between GPGPUs and CPUs are becoming more and more important nowadays. One trend in industry to alleviate the long latency is to integrate CPUs and GPGPUs on a single chip. In this paper, we analyze the reference interactions between CPU and GPGPU applications with a CPU-GPGPU co-simulator that integrates the gem5 and gpgpu-sim together. Since the memory controllers are shared among all cores, we observe severe memory contention between them. The CPU applications suffer a 1.26x slowdown and 64.79% blocked time in main memory when they run parallels with GPGPU applications. To alleviate the contention and provide more memory band-width, shared last level caches (LLCs) are commonly employed in such systems. We test a banked shared LLC structure that implanted into the co-simulator. We show that a simple shared LLC contributes mostly to the GPGPU (2.13x to running alone and 1.7x to running in parallel), rather than CPU. With the help of LLC, the memory requests issued to main memory is reduced to 30.74%, the blocked time is reduced to 49.64%, which provides more memory bandwidth. The latency-sensitive CPU applications are suffered as the LLC buffer occupation is very high when they run with GPGPU in parallel. Besides, as the number of LLC cache bank grows, we reveal that CPU achieves higher speedup than GPGPUs by increasing LLC parallelism. Finally, we also discuss the impact of GPGPU L2 cache. And we find that fewer GPGPU L2 cache banks will lower the performance as they limits the parallelism of GPGPU. The observations and inferences in this paper may serve as a reference guide to future CPU-GPGPU shared LLC design.

[1]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[2]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[4]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[5]  Dong Li,et al.  The tradeoffs of fused memory hierarchies in heterogeneous computing architectures , 2012, CF '12.

[6]  Yi Yang,et al.  CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[7]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[9]  Yooseong Kim,et al.  CuMAPz: A tool to analyze memory access patterns in CUDA , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[10]  Mattan Erez,et al.  A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC , 2012, DAC Design Automation Conference 2012.

[11]  Hyesoon Kim,et al.  TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[12]  Tao Li,et al.  Exploring GPGPU workloads: Characterization methodology, analysis and microarchitecture evaluation implications , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[13]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[14]  Antonia Zhai,et al.  Managing shared last-level cache in a heterogeneous multicore processor , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.