Edge chasing delayed consistency: pushing the limits of weak memory models

In shared memory multiprocessors utilizing invalidation-based coherence protocols, cache misses caused by inter-processor communication are a dominant source of processor stall cycles for many applications. We explore a novel coherence protocol implementation called edge-chasing delayed consistency (ECDC) that mitigates some of the performance degradation caused by this class of misses. Edge-chasing delayed consistency allows a processor to non-speculatively continue reading a cache line after receiving an invalidation from another core, without changing the consistency model offered to programmers. While the idea of using stale data for as long as possible is enticing, our study shows that the benefits of such delay are small, and that the majority of these delayed invalidation benefits come from mitigating the false sharing problem, rather than any tolerance of races or an application's ability to consume stale data in a productive manner.

[1]  Mikko H. Lipasti,et al.  On the value locality of store instructions , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[2]  Mikko H. Lipasti,et al.  Exploring, defining, and exploiting recent store value locality , 2003 .

[3]  Alan Jay Smith,et al.  Aspects of cache memory and instruction buffer performance , 1987 .

[4]  Cezary Dubnicki,et al.  Adjustable Block Size Coherent Caches , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[5]  Geoffrey M. Brown Asynchronous multicaches , 1990, Distributed Computing.

[6]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[7]  Michel Dubois,et al.  Cache protocols with partial block invalidations , 1993, [1993] Proceedings Seventh International Parallel Processing Symposium.

[8]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[9]  Per Stenström,et al.  Using Write Caches to Improve Performance of Cache Coherence Protocols in Shared-Memory Multiprocessors , 1995, J. Parallel Distributed Comput..

[10]  Michel Dubois,et al.  Delayed consistency and its effects on the miss rate of parallel programs , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[11]  Maged M. Michael Scalable lock-free dynamic memory allocation , 2004, PLDI '04.

[12]  Mikko H. Lipasti,et al.  Detecting and exploiting causal relationships in hardware shared-memory multiprocessors , 2004 .

[13]  Edgar Knapp,et al.  Deadlock detection in distributed databases , 1987, CSUR.

[14]  William E. Weihl,et al.  Scalable Concurrent B-Trees Using Multi-Version Memory , 1996, J. Parallel Distributed Comput..

[15]  Brian N. Bershad,et al.  Midway : shared memory parallel programming with entry consistency for distributed memory multiprocessors , 1991 .

[16]  Yehuda Afek,et al.  Lazy caching , 1993, TOPL.

[17]  Paul E. McKenney,et al.  READ-COPY UPDATE: USING EXECUTION HISTORY TO SOLVE CONCURRENCY PROBLEMS , 2002 .

[18]  Jaehyuk Huh,et al.  Coherence decoupling: making use of incoherence , 2004, ASPLOS XI.

[19]  K. Mani Chandy,et al.  A distributed algorithm for detecting resource deadlocks in distributed systems , 1982, PODC '82.

[20]  Sriram Vajapeyam,et al.  Non-strict cache coherence: exploiting data-race tolerance in emerging applications , 2000, Proceedings 2000 International Conference on Parallel Processing.

[21]  Jean-Loup BaerMay Design and Evaluation of a Subblock Cache Coherence Protocol for Bus-Based Multiprocessors , 1994 .

[22]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[23]  Alan L. Cox,et al.  Lazy release consistency for software distributed shared memory , 1992, ISCA '92.

[24]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[25]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[26]  Philip J. Woest,et al.  The Wisconsin multicube: a new large-scale cache-coherent multiprocessor , 1988, ISCA '88.

[27]  Michel Dubois,et al.  Essential Misses and Data Traffic in Coherence Protocols , 1995, J. Parallel Distributed Comput..

[28]  Erik Hagersten,et al.  Race-Free Interconnection Networks and Multiprocessor Consistency , 1991, ISCA.

[29]  Shubhendu S. Mukherjee,et al.  The Alpha 21364 network architecture , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[30]  Mikko H. Lipasti,et al.  Temporally silent stores , 2002, ASPLOS X.