Enable back memory and global synchronization on LLC buffer

The last-level cache (LLC) shared by heterogeneous processors such as CPU and general-purpose graphics processing unit (GPGPU) brings new opportunities to optimize data sharing among them. Previous work introduces the LLC buffer, which uses part of the LLC storage as a FIFO buffer to enable data sharing between CPU and GPGPU with negligible management overhead. However, the baseline LLC buffer’s capacity is limited and can lead to deadlock when the buffer is full. It also relies on inefficient CPU kernel relaunch and high overhead atomic operations on GPGPU for global synchronization. These limitations motivate us to enable back memory and global synchronization on the baseline LLC buffer and make it more practical. The back memory divides the buffer storage into two levels. While they are managed as a single queue, the data storage in each level is managed as individual circular buffer. The data are redirected to the memory level when the LLC level is full, and are loaded back to the LLC level when it has free space. The case study of n-queen shows that the back memory has a comparative performance with a LLC buffer of infinite LLC level. On the contrary, LLC buffer without back memory exhibits 10% performance degradation incurred by buffer space contention. The global synchronization is enabled by peeking the data about to be read from the buffer. Any request to read the data in LLC buffer after the global barrier is allowed only when all the threads reach the barrier. We adopt breadth-first search (BFS) as a case study and compare the LLC buffer with an optimized implementation of BFS on GPGPU. The results show the LLC buffer has speedup of 1.70 on average. The global synchronization time on GPGPU and CPU is decreased to 38 and 60–5%, respectively.

[1]  Yi Yang,et al.  Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors , 2015, LCPC.

[2]  Shinpei Kato,et al.  Gdev: First-Class GPU Resource Management in the Operating System , 2012, USENIX Annual Technical Conference.

[3]  Pei Yulong,et al.  LLC Buffer for Arbitrary Data Sharing in Heterogeneous Systems , 2016, 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[4]  Thomas F. Wenisch,et al.  Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[5]  Vivek Sarkar,et al.  Compiling and Optimizing Java 8 Programs for GPU Execution , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[6]  Gerhard Fettweis,et al.  M3: A Hardware/Operating-System Co-Design to Tame Heterogeneous Manycores , 2016, ASPLOS.

[7]  Margaret Martonosi,et al.  DeSC: Decoupled supply-compute communication management for heterogeneous architectures , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[9]  Fabien Coelho,et al.  Static Compilation Analysis for Host-Accelerator Communication Optimization , 2011, LCPC.

[10]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[11]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[12]  Michael F. P. O'Boyle,et al.  Portable and Transparent Host-Device Communication Optimization for GPGPU Environments , 2014, CGO '14.

[13]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[14]  Martin D. F. Wong,et al.  An effective GPU implementation of breadth-first search , 2010, Design Automation Conference.

[15]  Michael F. P. O'Boyle,et al.  Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems , 2014, ACM Trans. Archit. Code Optim..

[16]  Shinpei Kato,et al.  Zero-copy I/O processing for low-latency GPU computing , 2013, 2013 ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).

[17]  R. Govindarajan,et al.  Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[19]  Kunle Olukotun,et al.  Locality-Aware Mapping of Nested Parallel Patterns on GPUs , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  David F. Bacon,et al.  Compiling a high-level language for GPUs: (via language support for architectures and compilers) , 2012, PLDI.

[21]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[22]  Johann Glaser,et al.  Yosys-A Free Verilog Synthesis Suite , 2013 .

[23]  Feng Liu,et al.  Dynamically managed data for CPU-GPU architectures , 2012, CGO '12.

[24]  Vivek Sarkar,et al.  Machine-Learning-based Performance Heuristics for Runtime CPU/GPU Selection , 2015, PPPJ.