A Quantitative Analysis of the Performance and Scalability of Distributed Shared Memory Cache Coherence Protocols

Scalable cache coherence protocols have become the key technology for creating moderate to large-scale shared-memory multiprocessors. Although the performance of such multiprocessors depends critically on the performance of the cache coherence protocol, little comparative performance data is available. Existing commercial implementations use a variety of different protocols, including bit-vector/coarse-vector protocols, SCI-based protocols, and COMA protocols. Using the programmable protocol processor of the Stanford FLASH multiprocessor, we provide a detailed, implementation-oriented evaluation of four popular cache coherence protocols. In addition to measurements of the characteristics of protocol execution (e.g., memory overhead, protocol execution time, and message count) and of overall performance, we examine the effects of scaling the processor count from 1 to 128 processors. Surprisingly, the optimal protocol changes for different applications and can change with processor count even within the same application. These results help identify the strengths of specific protocols and illustrate the benefits of providing flexibility in the choice of cache coherence protocol.

[1]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[2]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[3]  Anant Agarwal,et al.  LimitLESS directories: A scalable cache coherence scheme , 1991, ASPLOS IV.

[4]  John L. Hennessy,et al.  The performance and scalability of distributed shared memory cache coherence protocols , 1998 .

[5]  V. Rich Personal communication , 1989, Nature.

[6]  T. Wicki,et al.  The Mercury Interconnect Architecture: A Cost-effective Infrastructure For High-performance Servers , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[7]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[8]  Anoop Gupta,et al.  Comparative performance evaluation of cache-coherent NUMA and COMA architectures , 1992, ISCA '92.

[9]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..

[10]  Anoop Gupta,et al.  Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors , 1998, ISCA.

[11]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[12]  T. Brewer,et al.  The evolution of the HP/Convex Exemplar , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[13]  J. L. Hennessy,et al.  An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors , 1993, Supercomputing '93.

[14]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[15]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[16]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[17]  John L. Hennessy,et al.  The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors , 1995 .

[18]  J. K. Archibald The cache coherence problem in shared-memory multiprocessors , 1987 .

[19]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[20]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[21]  John L. Hennessy,et al.  Evaluating the memory overhead required for COMA architectures , 1994, ISCA '94.

[22]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[23]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[24]  Jr. Richard Thomas Simoni,et al.  Cache coherence directories for scalable multiprocessors , 1992 .