Evaluating the effect of last-level cache sharing on integrated GPU-CPU systems with heterogeneous applications

Heterogeneous systems are ubiquitous in the field of High- Performance Computing (HPC). Graphics processing units (GPUs) are widely used as accelerators for their enormous computing potential and energy efficiency; furthermore, on-die integration of GPUs and general-purpose cores (CPUs) enables unified virtual address spaces and seamless sharing of data structures, improving programmability and softening the entry barrier for heterogeneous programming. Although on-die GPU integration seems to be the trend among the major microprocessor manufacturers, there are still many open questions regarding the architectural design of these systems. This paper is a step forward towards understanding the effect of on-chip resource sharing between GPU and CPU cores, and in particular, of the impact of last-level cache (LLC) sharing in heterogeneous computations. To this end, we analyze the behavior of a variety of heterogeneous GPU-CPU benchmarks on different cache configurations. We perform an evaluation of the popular Rodinia benchmark suite modified to leverage the unified memory address space. We find such GPGPU workloads to be mostly insensitive to changes in the cache hierarchy due to the limited interaction and data sharing between GPU and CPU. We then evaluate a set of heterogeneous benchmarks specifically designed to take advantage of the finegrained data sharing and low-overhead synchronization between GPU and CPU cores that these integrated architectures enable. We show how these algorithms are more sensitive to the design of the cache hierarchy, and find that when GPU and CPU share the LLC execution times are reduced by 25% on average, and energy-to-solution by over 20% for all benchmarks.

[1]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[2]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[3]  Antonia Zhai,et al.  Managing shared last-level cache in a heterogeneous multicore processor , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[4]  José Ignacio Benavides Benítez,et al.  An optimized approach to histogram computation on GPU , 2012, Machine Vision and Applications.

[5]  Sudhakar Yalamanchili,et al.  Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures , 2013, ACM Trans. Design Autom. Electr. Syst..

[6]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[7]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[8]  Mayank Daga,et al.  Exploiting Coarse-Grained Parallelism in B+ Tree Searches on an APU , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[9]  John Goodacre,et al.  Parallelism and the ARM instruction set architecture , 2005, Computer.

[10]  Tarek S. Abdelrahman,et al.  Parallel Radix Sort on the AMD Fusion Accelerated Processing Unit , 2013, 2013 42nd International Conference on Parallel Processing.

[11]  David A. Wood,et al.  GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors , 2015, 2015 IEEE International Symposium on Workload Characterization.

[12]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[13]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[14]  Wen-mei W. Hwu,et al.  Heterogeneous System Architecture: A New Compute Platform Infrastructure , 2015 .

[15]  Walter Stechele,et al.  Egomotion compensation and moving objects detection algorithm on GPU , 2011, PARCO.

[16]  Wu-chun Feng,et al.  On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[17]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[18]  Hyesoon Kim,et al.  TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[19]  Mitesh R. Meswani,et al.  Efficient breadth-first search on a heterogeneous processor , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[20]  Mattan Erez,et al.  A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC , 2012, DAC Design Automation Conference 2012.

[21]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[22]  Brian Vinter,et al.  Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors , 2015, Parallel Comput..

[23]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[24]  N. Muralimanohar,et al.  CACTI 6 . 0 : A Tool to Understand Large Caches , 2007 .

[25]  Daniel J. Sorin,et al.  Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.

[26]  Aamer Jaleel,et al.  Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[27]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[28]  David R. Kaeli,et al.  Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems , 2013, GPGPU@ASPLOS.

[29]  Mahmut T. Kandemir,et al.  Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  Peter Sewell,et al.  A Better x86 Memory Model: x86-TSO , 2009, TPHOLs.

[31]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Juan Gómez-Luna,et al.  In-Place Data Sliding Algorithms for Many-Core Architectures , 2015, 2015 44th International Conference on Parallel Processing.