TSO-CC: Consistency directed cache coherence for TSO

Traditional directory coherence protocols are designed for the strictest consistency model, sequential consistency (SC). When they are used for chip multiprocessors (CMPs) that support relaxed memory consistency models, such protocols turn out to be unnecessarily strict. Usually this comes at the cost of scalability (in terms of per core storage), which poses a problem with increasing number of cores in today's CMPs, most of which no longer are sequentially consistent. Because of the wide adoption of Total Store Order (TSO) and its variants in x86 and SPARC processors, and existing parallel programs written for these architectures, we propose TSO-CC, a cache coherence protocol for the TSO memory consistency model. TSO-CC does not track sharers, and instead relies on self-invalidation and detection of potential acquires using timestamps to satisfy the TSO memory consistency model lazily. Our results show that TSO-CC achieves average performance comparable to a MESI directory protocol, while TSO-CC's storage overhead per cache line scales logarithmically with increasing core count.

[1]  Antonio Robles,et al.  Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[2]  Sarita V. Adve,et al.  DeNovoND: efficient hardware support for disciplined non-determinism , 2013, ASPLOS '13.

[3]  Marcelo Cintra,et al.  An OS-based alternative to full hardware coherence on tiled CMPs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[4]  Michael F. Spear,et al.  NOrec: streamlining STM by abolishing ownership records , 2010, PPoPP '10.

[5]  BurgerDoug,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002 .

[6]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[7]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[8]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[9]  Ricardo Bianchini,et al.  Lazy Release Consistency for Hardware-Coherent Multiprocessors , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[10]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[11]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[12]  Christoforos E. Kozyrakis,et al.  SCD: A scalable coherence directory with flexible sharer set encoding , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[13]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[14]  Mustaque Ahamad,et al.  Implementing and programming causal distributed shared memory , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[15]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[16]  Mike O'Connor,et al.  Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[17]  Robert H. B. Netzer Optimal tracing and replay for debugging shared-memory parallel programs , 1993, PADD '93.

[18]  Thomas J. Ashby,et al.  Software-Based Cache Coherence with Hardware-Assisted Selective Self-Invalidations Using Bloom Filters , 2011, IEEE Transactions on Computers.

[19]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[20]  Michel Dubois,et al.  Correct memory operation of cache-based multiprocessors , 1987, ISCA '87.

[21]  Deborah A. Wallach PHD: A Hierarchical Cache Coherent Protocol , 1992 .

[22]  Tianshi Chen,et al.  DLS: Directoryless Shared Last-level Cache , 2012, ArXiv.

[23]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[24]  Francesco Zappa Nardelli,et al.  x86-TSO , 2010, Commun. ACM.

[25]  Gil Neiger,et al.  Causal memory: definitions, implementation, and programming , 1995, Distributed Computing.

[26]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[27]  Seth H. Pugsley,et al.  SWEL: Hardware cache coherence protocols to map shared data onto shared caches , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[28]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[29]  Babak Falsafi,et al.  Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[30]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[31]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Zhiqiang Ma,et al.  Ad Hoc Synchronization Considered Harmful , 2010, OSDI.

[33]  Kevin Skadron,et al.  Design issues and tradeoffs for write buffers , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[34]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[35]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[36]  Rajiv Gupta,et al.  Dynamic recognition of synchronization operations for improved data race detection , 2008, ISSTA '08.

[37]  Stefanos Kaxiras,et al.  Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[38]  Sarita V. Adve,et al.  DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[39]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[40]  Sang Lyul Min,et al.  Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps , 1992, IEEE Trans. Parallel Distributed Syst..

[41]  Rami G. Melhem,et al.  A timestamp-based selective invalidation scheme for multiprocessor cache coherence , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[42]  Sandhya Dwarkadas,et al.  SPACE: Sharing pattern-based directory coherence for multicore scalability , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[43]  Michel Dubois,et al.  Delayed consistency and its effects on the miss rate of parallel programs , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[44]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[45]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[46]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[47]  Jade Alglave,et al.  Litmus: Running Tests against Hardware , 2011, TACAS.

[48]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[49]  Stefanos Kaxiras,et al.  SARC Coherence: Scaling Directory Cache Coherence in Performance and Power , 2010, IEEE Micro.