Token Coherence: Low-Latency Coherence on Unordered Interconnects

Future shared-memory multiprocessor servers will target commercial workloads using highly-integrated “glueless” designs. Commercial workloads, which exhibit frequent sharing misses, benefit from the direct communication of snooping protocols. Unfortunately, snooping systems require a totally-ordered interconnect, which is difficult to efficiently implement in glueless designs. The standard alternative, directory protocols, are a poor match for commercial workloads because the indirection through the directory increases the latency of common sharing misses. An ideal coherence protocol would have processors communicate directly with one another, without indirections or fixed ordering point. Such an approach, however, introduces numerous races that are hard to resolve. We propose a new coherence framework to enable such protocols by separating performance from correctness. A performance protocol can optimize for the common case (i.e., absence of races) and rely on the underlying correctness substrate to provide safety and liveness. We call the combination Token Coherence, since it resolves races using the direct exchange of tokens to control coherence permissions. Token Coherence provides a framework that can support a wide variety of coherence protocols. This paper develops TokenB, a specific performance protocol that uses broadcast, but not snooping, for a 16-processor glueless multiprocessor with a high-bandwidth unordered interconnect. Simulations of commercial workloads (using a detailed memory system and out-of-order processor models) show that our new protocol significantly outperforms both snooping and directory protocols.

[1]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[2]  S. J. Frank,et al.  Tightly coupled multiprocessor system speeds memory-access times , 1984 .

[3]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[4]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[5]  P. Stenström,et al.  A cache consistency protocol for multiprocessors with multistage networks , 1989, ISCA.

[6]  P. Stenstrom A Cache Consistency Protocol For Multiprocessors With Multistage Networks , 1989, The 16th Annual International Symposium on Computer Architecture.

[7]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[8]  Erik Hagersten,et al.  Race-Free Interconnection Networks and Multiprocessor Consistency , 1991, ISCA.

[9]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessor , 1992, ASPLOS V.

[10]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[11]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[12]  Evgenia Smirni,et al.  The KSR1: experimentation and modeling of poststore , 1993, SIGMETRICS '93.

[13]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[14]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[15]  Pen-Chung Yew,et al.  Data Prefetching and Data Forwarding in Shared Memory Multiprocessors , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[16]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[17]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[18]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[19]  Josep Torrellas,et al.  Data Forwarding in Scalable Shared-Memory Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[20]  Sudhakar Yalamanchili,et al.  Interconnection Networks: An Engineering Approach , 2002 .

[21]  Sarita V. Adve,et al.  An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[22]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[23]  Alan E. Charlesworth,et al.  Starfire: extending the SMP envelope , 1998, IEEE Micro.

[24]  Mark Horowitz,et al.  High-speed electrical signaling: overview and limitations , 1998, IEEE Micro.

[25]  William J. Dally,et al.  Digital systems engineering , 1998 .

[26]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[27]  M. Hill,et al.  Multicast snooping: a new coherence method using a multicast address network , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[28]  Larry Rudolph,et al.  CACHET: an adaptive cache coherence protocol for distributed shared-memory systems , 1999, ICS '99.

[29]  Maged M. Michael,et al.  High-throughput coherence controllers , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[30]  Babak Falsafi,et al.  Address partitioning in DSM clusters with parallel coherence controllers , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[31]  Kourosh Gharachorloo,et al.  Efficient ECC-Based Directory Implementations for Scalable Multiprocessors , 2000 .

[32]  Milo M. K. Martin,et al.  Timestamp snooping: an approach for extending SMPs , 2000, ASPLOS.

[33]  Shubhendu S. Mukherjee,et al.  The Alpha 21 364 Network Architecture , 2001 .

[34]  Min Xu,et al.  Evaluating Non-deterministic Multi-threaded Commercial Workloads , 2001 .

[35]  Babak Falsafi,et al.  JETTY: filtering snoops for reduced energy consumption in SMP servers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[36]  Shubhendu S. Mukherjee,et al.  The Alpha 21364 network architecture , 2001, HOT 9 Interconnects. Symposium on High Performance Interconnects.

[37]  José González,et al.  Owner Prediction for Accelerating Cache-to-Cache Transfer Misses in a cc-NUMA Architecture , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[38]  Transactional lock-free execution of lock-based programs , 2002, ASPLOS.

[39]  Milo M. K. Martin,et al.  Bandwidth adaptive snooping , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[40]  Faye A. Briggs,et al.  Scalability port: a coherent interface for shared memory multiprocessors , 2002, Proceedings 10th Symposium on High Performance Interconnects.

[41]  David A. Wood,et al.  Full-system timing-first simulation , 2002, SIGMETRICS '02.

[42]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[43]  Shubhendu S. Mukherjee,et al.  The Alpha 21364 Network Architecture , 2002, IEEE Micro.

[44]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[45]  James R. Goodman,et al.  Transactional lock-free execution of lock-based programs , 2002, ASPLOS X.

[46]  José González,et al.  The use of prediction for accelerating upgrade misses in cc-NUMA multiprocessors , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[47]  David A. Wood,et al.  Variability in architectural simulations of multi-threaded workloads , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[48]  Michael Stumm,et al.  Scalable cache consistency for hierarchically structured multiprocessors , 2005, The Journal of Supercomputing.