Cooperative shared memory: software and hardware for scalable multiprocessors

We believe the paucity of massively parallel, shared-memory machines follows from the lack of a shared-memory programming performance model that can inform programmers of the cost of operations (so they can avoid expensive ones) and can tell hardware designers which cases are common (so they can build simple hardware to optimize them). Cooperative shared memory, our approach to shared-memory design, addresses this problem. Our initial implementation of cooperative shared memory uses a simple programming model, called Check-In/Check-Out (CICO), in conjunction with even simpler hardware, called Dir1SW. In CICO, programs bracket uses of shared data with a check_in directive terminating the expected use of the data. A cooperative prefetch directive helps hide communication latency. Dir1SW is a minimal directory protocol that adds little complexity to message-passing hardware, but efficiently supports programs written within the CICO model.

[1]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[2]  John H. Howard,et al.  A virtual machine emulator for performance evaluation , 1980, CACM.

[3]  James Archibald,et al.  An economical solution to the cache coherence problem , 1984, ISCA 1984.

[4]  James K. Archibald,et al.  An economical solution to the cache coherence problem , 1984, ISCA '84.

[5]  Bell Cg,et al.  Multis: a new class of multiprocessor computers. , 1985 .

[6]  C. G. Bell Multis: A New Class of Multiprocessor Computers , 1985, Science.

[7]  Randy H. Katz,et al.  Implementing a cache consistency protocol , 1985, ISCA '85.

[8]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[9]  H. Cheong,et al.  A cache coherence scheme with fast selective invalidation , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[10]  Kevin P. McAuliffe,et al.  Automatic Management of Programmable Caches , 1988, ICPP.

[11]  Randy H. Katz,et al.  The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS III.

[12]  Sang Lyul Min,et al.  A Timestamp-based Cache Coherence Scheme , 1989, ICPP.

[13]  Randy H. Katz,et al.  Verifying a Multiprocessor Cache Controller Using Random Case , 1989 .

[14]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS III.

[15]  Anoop Gupta,et al.  Analysis of cache invalidation patterns in multiprocessors , 1989, ASPLOS III.

[16]  Mary K. Vernon,et al.  Efficient synchronization primitives for large-scale cache-coherent multiprocessors , 1989, ASPLOS 1989.

[17]  Lawrence Snyder,et al.  A Comparison of Programming Models for Shared Memory Multiprocessors , 1990, ICPP.

[18]  James R. Larus,et al.  Cache considerations for multiprocessor programmers , 1990, CACM.

[19]  Douglas Johnson,et al.  Trap architectures for Lisp systems , 1990, LISP and Functional Programming.

[20]  Randy H. Katz,et al.  Verifying a multiprocessor cache controller using random test generation , 1990, IEEE Design & Test of Computers.

[21]  David Chaiken,et al.  CACHE COHERENCE PROTOCOLS FOR LARGE-SCALE MULTIPROCESSORS , 1990 .

[22]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[23]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[24]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[25]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[26]  Hendrik A. Goosen,et al.  Paradigm: a highly scalable shared-memory multicomputer architecture , 1991, Computer.

[27]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[28]  T. Mowry,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[29]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[30]  Mary K. Vernon,et al.  Comparison of hardware and software cache coherence schemes , 1991, ISCA '91.

[31]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessor , 1992, ASPLOS V.

[32]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[33]  David B. Gustavson The Scalable Coherent Interface and related standards projects , 1992, IEEE Micro.

[34]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[35]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[36]  James R. Larus,et al.  Mechanisms for cooperative shared memory , 1993, ISCA '93.

[37]  Philip Machanick,et al.  Restructuring a parallel simulation to improve cache behavior in a shared-memory multiprocessor: the value of distributed synchronization , 1993, PADS '93.

[38]  Anoop Gupta,et al.  The DASH Prototype: Logic Overhead and Performance , 1993, IEEE Trans. Parallel Distributed Syst..