Software Assisted Hardware Cache Coherence for Heterogeneous Processors

Current trends suggest that future computing platforms will be increasingly heterogeneous. While these heterogeneous processors physically integrate disparate computing elements like CPUs and GPUs on a single chip, their programmability critically depends upon the ability to efficiently support cache coherence and shared virtual memory across tightly-integrated CPUs and GPUs. However, throughput-oriented GPUs easily overwhelm existing hardware coherence mechanisms that long kept the cache hierarchies in multi-core CPUs coherent. This paper proposes a novel solution called Software Assisted Hardware Coherence (SAHC) to scale cache coherence to future heterogeneous processors. We observe that the system software (Operating system and runtime) often has semantic knowledge about sharing patterns of data across the CPU and the GPU. This high-level knowledge can be utilized to effectively provide cache coherence across throughput-oriented GPUs and latency-sensitive CPUs in a heterogeneous processor. SAHC thus proposes a hybrid software-hardware mechanism that judiciously uses hardware coherence only when needed while using software's knowledge to filter out most of the unnecessary coherence traffic. Our evaluation suggests that SAHC can often eliminate up to 98-100% of the hardware coherence lookups, resulting up to 49% reduction in runtime.

[1]  Andreas Moshovos,et al.  A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[4]  Babak Falsafi,et al.  Multi-grain coherence directories , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Jaehyuk Huh,et al.  Subspace snooping: Filtering snoops with operating system support , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[6]  Mike O'Connor,et al.  Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[7]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[8]  Niraj K. Jha,et al.  In-Network Coherence Filtering: Snoopy coherence without broadcasts , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Babak Falsafi,et al.  JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[10]  Babak Falsafi,et al.  TurboTag: Lookup filtering to reduce coherence directory power , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[11]  William J. Dally,et al.  Stream register files with indexed access , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[12]  Sanjay J. Patel,et al.  Cohesion: a hybrid memory model for accelerators , 2010, ISCA.

[13]  Andreas Moshovos RegionScout: exploiting coarse grain sharing in snoop-based coherence , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[14]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[15]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[16]  Natalie D. Enright Jerger,et al.  Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[17]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[18]  Mohammad Alisafaee Spatiotemporal Coherence Tracking , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.