Implementation and evaluation of update-based cache protocols under relaxed memory consistency models

Abstract Invalidation-based cache coherence protocols have been extensively studied in the context of large-scale shared-memory multiprocessors. Under a relaxed memory consistency model, most of the write latency can be hidden whereas cache misses still incur a severe performance problem. By contrast, update-based protocols have a potential to reduce both write and read penalties under relaxed memory consistency models because coherence misses can be completely eliminated. The purpose of this paper is to compare update- and invalidation-based protocols for their ability to reduce or hide memory access latencies and for their ease of implementation under relaxed memory consistency models. Based on a detailed simulation study, we find that write-update protocols augmented with simple competitive mechanisms — we call such protocols competitive-update protocols — can hide all the write latency and cut the read penalty by as much as 46% at the cost of some increase in the memory traffic. However, as compared to write-invalidate, update-based protocols require more aggressive memory consistency models and more local buffering in the second-level cache to be effective. In addition, their increased number of global writes may cause increased synchronization overhead in applications with high contention for critical sections.

[1]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[2]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[3]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[4]  Per Stenström,et al.  The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors , 1993, [1993] Proceedings 26th Annual Simulation Symposium.

[5]  Michel Dubois,et al.  Access ordering and coherence in shared memory multiprocessors , 1989 .

[6]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[7]  Robert J. Fowler,et al.  A performance evaluation of optimal hybrid cache coherency protocols , 1992, ASPLOS V.

[8]  James K. Archibald A cache coherence approach for large multiprocessor systems , 1988, ICS '88.

[9]  Per Stenström,et al.  An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic , 1994, PARLE.

[10]  Anoop Gupta,et al.  Hiding memory latency using dynamic scheduling in shared-memory multiprocessors , 1992, ISCA '92.

[11]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[12]  Michel Dubois,et al.  Scalable Shared Memory Multiprocessors , 1992, Springer US.

[13]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[14]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[15]  Rainer Hoch,et al.  From paper to office document standard representation , 1992, Computer.

[16]  P. Stenstrom A survey of cache coherence schemes for multiprocessors , 1990, Computer.

[17]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[18]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[19]  Jean-Loup Baer,et al.  A performance study of memory consistency models , 1992, ISCA '92.

[20]  Michel Dubois,et al.  Memory Access Dependencies in Shared-Memory Multiprocessors , 1990, IEEE Trans. Software Eng..

[21]  T. Mowry,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[22]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[23]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[24]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[25]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[26]  Michel Dubois,et al.  International Conference on Parallel Processing Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors , 2006 .

[27]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[28]  Per Stenström,et al.  Reducing the Write Traffic for a Hybrid Cache Protocol , 1994, 1994 International Conference on Parallel Processing Vol. 1.

[29]  Anna R. Karlin,et al.  Competitive snoopy caching , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[30]  Richard P. LaRowe,et al.  Hiding Shared Memory Reference Latency on the Galactica Net Distributed Shared Memory Architecture , 1992, J. Parallel Distributed Comput..

[31]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[32]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[33]  Lars Lundberg,et al.  A Lockup-Free Multiprocessor Cache Design , 1991, ICPP.

[34]  Richard P. LaRowe,et al.  Hardware assist for distributed shared memory , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[35]  R. H. Katz,et al.  Evaluating the performance of four snooping cache coherency protocols , 1989, ISCA '89.

[36]  Per Stenström,et al.  A Survey of Cache Coherence Schemes for Multiprocessors , 1990, Computer.

[37]  Mosur Ravishankar,et al.  PLUS: a distributed shared-memory system , 1990, ISCA '90.