The locality-aware adaptive cache coherence protocol

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in future processors. We propose a scalable, efficient shared memory cache coherence protocol that enables seamless adaptation between private and logically shared caching of on-chip data at the fine granularity of cache lines. Our data-centric approach relies on in-hardware yet low-overhead runtime profiling of the locality of each cache line and only allows private caching for data blocks with high spatio-temporal locality. This allows us to better exploit the private caches and enable low-latency, low-energy memory access, while retaining the convenience of shared memory. On a set of parallel benchmarks, our low-overhead locality-aware mechanisms reduce the overall energy by 25% and completion time by 15% in an NoC-based multicore with the Reactive-NUCA on-chip cache organization and the ACKwise limited directory-based coherence protocol.

[1]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, MICRO 1995.

[2]  Aamer Jaleel,et al.  Achieving Non-Inclusive Cache Performance with Inclusive Caches: Temporal Locality Aware (TLA) Cache Management Policies , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[5]  Henry Hoffmann,et al.  Remote Store Programming , 2010, HiPEAC.

[6]  Håkan Grahn,et al.  ParMiBench - An Open-Source Benchmark for Embedded Multiprocessor Systems , 2010, IEEE Computer Architecture Letters.

[7]  Chen Sun,et al.  DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.

[8]  George Kurian,et al.  ATAC: A 1000-core cache-coherent processor with on-chip optical network , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[10]  Mark Anders,et al.  Near-threshold voltage (NTV) design — Opportunities and challenges , 2012, DAC Design Automation Conference 2012.

[11]  Nancy Warter-Perez,et al.  Modulo scheduling with multiple initiation intervals , 1995, MICRO 1995.

[12]  Vijayalakshmi Srinivasan,et al.  A Tagless Coherence Directory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Natalie D. Enright Jerger,et al.  Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support , 2008, 2008 International Symposium on Computer Architecture.

[14]  George Kurian,et al.  Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[15]  Vijayalakshmi Srinivasan,et al.  SPATL: Honey, I Shrunk the Coherence Directory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[16]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[18]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[19]  Krste Asanovic,et al.  Victim replication: maximizing capacity while hiding wire delay in tiled chip multiprocessors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[20]  Jaehyuk Huh,et al.  Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[21]  Christoforos E. Kozyrakis,et al.  SCD: A scalable coherence directory with flexible sharer set encoding , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[22]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[23]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[25]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.