An adaptive cache coherence protocol optimized for migratory sharing

Parallel programs that use critical sections and are executed on a shared-memory multiprocessor with a write-invalidate protocol result in invalidation actions that could be eliminated. For this type of sharing, called migratory sharing, each processor typically causes a cache miss followed by an invalidation request which could be merged with the preceding cache-miss request. In this paper we propose an adaptive protocol that invokes this optimization dynamically for migratory blocks. For other blocks, the protocol works as an ordinary write-invalidate protocol. We show that the protocol is a simple extension to a write-invalidate protocol. Based on a program-driven simulation model of an architecture similar to the Stanford DASH, and a set of four benchmarks, we evaluate the potential performance improvements of the protocol. We find that it effectively eliminates most single invalidations which improves the performance by reducing the shared access penalty and the network traffic.

[1]  Robert A. Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, ISCA '88.

[2]  JOHN L. HENNESSY,et al.  VLSI Processor Architecture , 1984, IEEE Transactions on Computers.

[3]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[4]  Janak H. Patel,et al.  Performance evaluation of multiple register sets , 1987, ISCA '87.

[5]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[6]  James H. Patterson,et al.  Portable Programs for Parallel Processors , 1987 .

[7]  Peter J. Denning,et al.  Working Sets Past and Present , 1980, IEEE Transactions on Software Engineering.

[8]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[9]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .

[10]  Anoop Gupta,et al.  Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.

[11]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[12]  David A. Patterson,et al.  Reduced instruction set computers , 1985, CACM.

[13]  Per Stenström,et al.  The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors , 1993, [1993] Proceedings 26th Annual Simulation Symposium.

[14]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[15]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[16]  Eric A. Brewer,et al.  PROTEUS: a high-performance parallel-architecture simulator , 1992, SIGMETRICS '92/PERFORMANCE '92.

[17]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[18]  David E. Culler,et al.  Analysis of multithreaded architectures for parallel computing , 1990, SPAA '90.

[19]  Anant Agarwal,et al.  Performance Tradeoffs in Multithreaded Processors , 1992, IEEE Trans. Parallel Distributed Syst..

[20]  Burton J. Smith,et al.  A processor architecture for Horizon , 1988, Proceedings. SUPERCOMPUTING '88.

[21]  Robert H. Halstead,et al.  MASA: a multithreaded processor architecture for parallel symbolic computing , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[22]  A. Gupta,et al.  Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results , 1989, ISCA '89.

[23]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[24]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[25]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[26]  Anant Agarwal,et al.  Waiting algorithms for synchronization in large-scale multiprocessors , 1993, TOCS.

[27]  Hideo Aiso,et al.  Proceedings of the 16th annual international symposium on Computer architecture , 1986 .

[28]  Susan J. Eggers,et al.  The effect on RISC performance of register set size and structure versus code generation strategy , 1991, ISCA '91.

[29]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[30]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[31]  William E. Weihl,et al.  Register relocation: flexible contexts for multithreading , 1993, ISCA '93.

[32]  R. S. Nikhil Can dataflow subsume von Neumann computing? , 1989, ISCA '89.

[33]  Lars Lundberg,et al.  A Lockup-Free Multiprocessor Cache Design , 1991, ICPP.

[34]  William J. Dally,et al.  A mechanism for efficient context switching , 1991, [1991 Proceedings] IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[35]  Allan Porterfield,et al.  The Tera computer system , 1990 .