Spandex: A Flexible Interface for Efficient Heterogeneous Coherence

Recent heterogeneous architectures have trended toward tighter integration and shared memory largely due to the efficient communication and programmability enabled by this shift. However, such integration is complex, because accelerators have widely disparate methods for accessing and keeping data coherent. Some processors use caches %that are backed by hardware coherence protocols like MESI, while others prefer lightweight software coherence protocols or use specialized memories like scratchpads with differing state and communication granularities. Modern solutions tend to build interfaces that extend existing MESI-style CPU coherence protocols, often by adding hierarchical indirection through intermediate shared caches. Although functionally correct, these strategies lack flexibility and generally suffer from performance limitations that make them sub-optimal for some emerging accelerators and workloads. Instead, we need a flexible interface that can efficiently integrate existing and future devices – without requiring intrusive changes to their memory structure. We introduce Spandex, an improved coherence interface based on the simple and scalable DeNovo coherence protocol. Spandex (which takes its name from the flexible material commonly used in one-size-fits-all textiles) directly interfaces devices with diverse coherence properties and memory demands, enabling each device to communicate in a manner appropriate for its specific access properties. We demonstrate the importance of this flexibility by comparing this strategy against a more conventional MESI-based hierarchical solution for a diverse range of heterogeneous applications. On average for the applications studied, Spandex reduces execution time by 16% (max 29%) and network traffic by 27% (max 58%) relative to the MESI-based hierarchical solution.

[1]  Margaret Martonosi,et al.  COATCheck: Verifying Memory Ordering at the Hardware-OS Interface , 2016, ASPLOS.

[2]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Erik Hagersten,et al.  Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead , 2016, ACM Trans. Archit. Code Optim..

[4]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Christopher Batten,et al.  Accelerating Irregular Algorithms on GPGPUs Using Fine-Grain Hardware Worklists , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  Thomas F. Wenisch,et al.  Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7]  Mike O'Connor,et al.  Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[9]  David A. Wood,et al.  Crossing Guard: Mediating Host-Accelerator Coherence Interactions , 2017, ASPLOS.

[10]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[11]  Derek Hower,et al.  HRF-Relaxed: Adapting HRF to the Complexities of Industrial Heterogeneous Memory Models , 2015, TACO.

[12]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Sarita V. Adve,et al.  DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.

[14]  Sarita V. Adve,et al.  Chasing Away RAts: Semantics and evaluation for relaxed atomics on heterogeneous systems , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[15]  Sarita V. Adve,et al.  Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Mikko H. Lipasti,et al.  Coarse-Grain Coherence Tracking: RegionScout and Region Coherence Arrays , 2006, IEEE Micro.

[17]  Jeffrey B. Rothman,et al.  Sector cache design and performance , 2000, Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728).

[18]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[19]  Sarita V. Adve,et al.  DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations , 2015, ASPLOS.

[20]  Mark D. Hill,et al.  Weak ordering—a new definition , 1998, ISCA '98.

[21]  Antonio J. Peña,et al.  Chai: Collaborative heterogeneous applications for integrated-architectures , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[22]  Snehasish Kumar,et al.  Fusion: Design tradeoffs in coherent cache hierarchies for accelerators , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[23]  David A. Wood,et al.  Heterogeneous-race-free memory models , 2014, ASPLOS.

[24]  Sarita V. Adve,et al.  HeteroSync: A benchmark suite for fine-grained synchronization on tightly coupled GPUs , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[25]  David A. Wood,et al.  Synchronization Using Remote-Scope Promotion , 2015, ASPLOS.

[26]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[27]  Jeffrey Stuecheli,et al.  CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..

[28]  Jonathan White,et al.  Carrizo: A High Performance, Energy Efficient 28 nm APU , 2016, IEEE Journal of Solid-State Circuits.

[29]  David A. Wood,et al.  QuickRelease: A throughput-oriented approach to release consistency on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[30]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[31]  Josep Torrellas,et al.  False Sharing ans Spatial Locality in Multiprocessor Caches , 1994, IEEE Trans. Computers.

[32]  Thomas M. Conte,et al.  Manager-client pairing: A framework for implementing coherence hierarchies , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Ian Bratt,et al.  The ARM® Mali-T880 Mobile GPU , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[34]  Margaret Martonosi,et al.  ArMOR: Defending against memory consistency model mismatches in heterogeneous architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[35]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[36]  Sandhya Dwarkadas,et al.  Protozoa: adaptive granularity cache coherence , 2013, ISCA.

[37]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[38]  David A. Wood,et al.  Lazy release consistency for GPUs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).