LLC Buffer for Arbitrary Data Sharing in Heterogeneous Systems

Closely coupled CPU and GPGPU system with the shared last level cache (LLC) enables fine-grained data exchange. However, traditional data exchange causes unnecessary LLC misses and degrades the entire system performance. We believe that the cache organization is not suitable for the temporary data exchange in the closely coupled system. We analyze the memory access pattern and discover the inefficiency data exchange. When the exchanged data cannot fit in the LLC, the low LLC hit rate exacerbates core stalls and memory contention. We also show that the stalls cannot be entirely covered by increasing the compute load or parallelism. In previous work, a simple LLC buffer is introduced to replace the cache with an architecture-supported data queue. However, the simple design limits the data element size and requires a potentially very large storage for pending requests. In this paper, we propose an improved LLC buffer. It adopts element-atom data organization to enable data exchange of arbitrary size. A simple hardware-software collaborated protocol is adopted, and eliminates the pending requests. The experiment results reveal that it has an average speedup of 48.2% compared with the traditional way, but incurs a 7.5% slowdown compared with the simple LLC buffer due to the overhead of the protocol. We also compare it with the fine-grain task, which implements a data exchange channel between CPU and GPGPU. The results show that the improved LLC buffer has less storage overhead but higher access efficiency than the fine-grain task.

[1]  Hyesoon Kim,et al.  TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[2]  Scott A. Mahlke,et al.  Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[3]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[4]  Hong Jiang,et al.  Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[6]  Benedict R. Gaster,et al.  Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck? , 2012, Computer.

[7]  Ken A. Hawick,et al.  Parallel Algorithms for Hybrid Multi-core CPU-GPU Implementations of Component Labelling in Critical Phase Models , 2013 .

[8]  David A. Wood,et al.  Fine-grain task aggregation and coordination on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[9]  Hsien-Hsin S. Lee,et al.  COMPASS: a programmable data prefetcher using idle GPU shaders , 2010, ASPLOS 2010.

[10]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[11]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[12]  Licheng Yu,et al.  Buffer on Last Level Cache for CPU and GPGPU Data Sharing , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[13]  Antonia Zhai,et al.  Managing shared last-level cache in a heterogeneous multicore processor , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[14]  Yi Yang,et al.  CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[15]  Mattan Erez,et al.  A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC , 2012, DAC Design Automation Conference 2012.

[16]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[17]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[18]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).