A study of three dynamic approaches to handle widely shared data in shared-memory multiprocessors

In this paper we argue that widely shared data are a more serious problem than previously recognized, and that furthermore, it is possible to provide transparent support that actually gives an advantage to accesses to widely shared data by exploiting their redundancy to improve accessibility. The GLOW extensions to cache coherence pmtocofs -previously proposedprovide such support for widely shared data by defining functionality in the network domain. However in their static form the GLOW extensions relied on the user to identify and expose widely shared data to the hardware. This approach suffers because: i) it requires modification of the programs, ii) it is not always possible to statically idenhfi the widely shared data, and iii) it is incompatible with cornmod@ hardware. To address these issues, we study three dynamic schemes to discover widely shared data at runtime. The first scheme is inspired by read-combining and is based on observing requests in the network switches the GLOW agents. The agents intercept requests whose addresses have been observed recently. This scheme tracks closely the pegormance of the static GLOW while it always outpelfomrs ordinary congestion-based readcombining. In the second scheme, the memory directory discovers widely shared data by counting the number of reaa!s between writes. Information about the widely shared nature of data is distributed to the nodes which subsequently use special wide sharing requests to access them. Simulations confrm that this scheme works well when the widely shared nature of the data is persistent over time. The third and most significant scheme is based on predicting which load instructions are going to access widely shared data. Although the implementation of this scheme is not as straighrforwani in a commodity-parts environment, it outperforms all others.

[1]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[2]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[3]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[4]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[5]  Abhinav Gupta,et al.  Analysis of cache invalidation patterns in multiprocessors , 1989, ASPLOS 1989.

[6]  James R. Goodman,et al.  Techniques for reducing overheads of shared-memory multiprocessing , 1995, ICS '95.

[7]  Stefanos Kaxiras,et al.  Improving Request-Combining for Widely Shared Data in Shared-Memory , 2022 .

[8]  Anoop Gupta,et al.  Analysis of cache invalidation patterns in multiprocessors , 1989, ASPLOS III.

[9]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[10]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[11]  Per Stenström,et al.  The Scalable Tree Protocol-a cache coherence approach for large-scale multiprocessors , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[12]  Ricardo Bianchini,et al.  Eager combining: a coherency protocol for increasing effective network and memory bandwidth in shared-memory multiprocessors , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[13]  Stefanos Kaxiras,et al.  The GLOW cache coherence protocol extensions for widely shared data , 1996, ICS '96.

[14]  Ross Evan Johnson,et al.  Extending the scalable coherent interface for large-scale shared-memory multiprocessors , 1993 .

[15]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[16]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[17]  Stefanos Kaxiras,et al.  Kiloprocessor Extensions to SCI , 1996, Proceedings of International Conference on Parallel Processing.

[18]  Paul H. J. Kelly,et al.  Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors , 1996, Euro-Par, Vol. II.

[19]  Philip J. Woest,et al.  The Wisconsin multicube: a new large-scale cache-coherent multiprocessor , 1988, ISCA '88.

[20]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[21]  Corporate IEEE Standard for Scalable Coherent Interface, Science: IEEE Std. 1596-1992 , 1993 .

[22]  Gregory F. Pfister,et al.  “Hot spot” contention and combining in multistage interconnection networks , 1985, IEEE Transactions on Computers.

[23]  A. Gottleib,et al.  The nyu ultracomputer- designing a mimd shared memory parallel computer , 1983 .

[24]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[25]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[26]  Wen-Hann Wang,et al.  Characteristics Of Performance-Optimal Multi-level Cache Hierarchies , 1989, The 16th Annual International Symposium on Computer Architecture.

[27]  Dhiraj K. Pradhan,et al.  Two economical directory schemes for large-scale cache coherent multiprocessors , 1991, CARN.