论文信息 - Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

The paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache block before a conflicting access by another processor. Eliminating invalidation overhead is particularly important under sequential consistency: where the latency of invalidating outstanding copies can increase a program's critical path. DSI is applicable to software, hardware, and hybrid coherence schemes. We evaluate DSI in the context of hardware directory-based write-invalidate coherence protocols. Our results show that DSI reduces execution time of a sequentially consistent full-map coherence protocol by as much as 41%. This is comparable to an implementation of weak consistency that uses a coalescing write-buffer to allow up to 16 outstanding requests for exclusive blocks. When used in conjunction with weak consistency DSI can exploit tear-off blocks-which eliminate both invalidation and acknowledgment messages-for a total reduction in messages of up to 26%.

David A. Wood | Alvin R. Lebeck | D. Wood | A. Lebeck

[1] Calvin K. Tang. Cache system design in the tightly coupled multiprocessor system , 1976, AFIPS '76.

[2] Paul Feautrier,et al. A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[3] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[4] Burton J. Smith. Architecture And Applications Of The HEP Multiprocessor Computer System , 1982, Optics & Photonics.

[5] Kevin P. McAuliffe,et al. Automatic Management of Programmable Caches , 1988, ICPP.

[6] Thomas E. Anderson,et al. The Performance Implications of Spin-Waiting Alternatives for Shared-Memory Multiprocessors , 1989, ICPP.

[7] Michel Dubois,et al. Access ordering and coherence in shared memory multiprocessors , 1989 .

[8] Alexander V. Veidenbaum,et al. Compiler-directed cache management in multiprocessors , 1990, Computer.

[9] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[10] Mark D. Hill,et al. Implementing Sequential Consistency in Cache-Based Systems , 1990, ICPP.

[11] M. Hill,et al. Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[12] Anoop Gupta,et al. Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[13] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[14] Anant Agarwal,et al. LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[15] T. Mowry,et al. Comparative evaluation of latency reducing and tolerating techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[16] Anoop Gupta,et al. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[17] James R. Larus,et al. Cooperative shared memory: software and hardware for scalable multiprocessor , 1992, ASPLOS V.

[18] Anoop Gupta,et al. SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[19] Sang Lyul Min,et al. Design and Analysis of a Scalable Cache Coherence Scheme Based on Clocks and Timestamps , 1992, IEEE Trans. Parallel Distributed Syst..

[20] Anoop Gupta,et al. The Stanford Dash multiprocessor , 1992, Computer.

[21] Erik Hagersten,et al. DDM - A Cache-Only Memory Architecture , 1992, Computer.

[22] T. von Eicken,et al. Parallel programming in Split-C , 1993, Supercomputing '93.

[23] Robert J. Fowler,et al. Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[24] Andrea C. Arpaci-Dusseau,et al. Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[25] James R. Larus,et al. The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[26] Mats Brorsson,et al. An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[27] James R. Larus,et al. Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[28] K. Kennedy,et al. Cache coherence using local knowledge , 1993, Supercomputing '93.

[29] Anoop Gupta,et al. The Stanford FLASH Multiprocessor , 1994, ISCA.

[30] James R. Larus,et al. Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[31] Anoop Gupta,et al. The Stanford FLASH multiprocessor , 1994, ISCA '94.

[32] Ken Chan,et al. PA7200: a PA-RISC processor with integrated high performance MP bus interface , 1994, Proceedings of COMPCON '94.

[33] Pen-Chung Yew,et al. A compiler-directed cache coherence scheme with improved intertask locality , 1994, Proceedings of Supercomputing '94.

[34] James R. Larus,et al. Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[35] James R. Larus,et al. Mechanisms for Cooperative Shared Memory , 1994 .

[36] P. Stenström,et al. Combined performance gains of simple cache protocol extensions , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[37] James R. Larus,et al. Cachier: A Tool for Automatically Inserting CICO Annotations , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[38] Michel Dubois,et al. Combined performance gains of simple cache protocol extensions , 1994, ISCA '94.

[39] Alan L. Cox,et al. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[40] Gregory R. Andrews,et al. Distributed filaments: efficient fine-grain parallelism on a cluster of workstations , 1994, OSDI '94.

[41] David E. Culler,et al. A case for NOW (networks of workstation) , 1995, PODC '95.

[42] Alvin R. Lebeck,et al. Tools and techniques for memory system design and analysis , 1996 .