Multicast snooping: a new coherence method using a multicast address network

This paper proposes a new coherence method called "multicast snooping" that dynamically adapts between broadcast snooping and a directory protocol. Multicast snooping is unique because processors predict which caches should snoop each coherence transaction by specifying a multicast "mask." Transactions are delivered with an ordered multicast network, such as an Isotach network, which eliminates the need for acknowledgment messages. Processors handle transactions as they would with a snooping protocol, while a simplified directory operates in parallel to check masks and gracefully handle incorrect ones (e.g., previous owner missing). Preliminary performance numbers with mostly SPLASH-2 benchmarks running on 32 processors show that we can limit multicasts to an average of 2-6 destinations (<< 32) and we can deliver 2-5 multicasts per network cycle (>> broadcast snooping's 1 per cycle). While these results do not include timing, they do provide encouragement that multicast snooping can obtain data directly (like broadcast snooping) but apply to larger systems (like directories).

[1]  J. Rothnie,et al.  The KSR 1: bridging the gap between shared memory and MPPs , 1993, Digest of Papers. Compcon Spring.

[2]  Mark D. Hill,et al.  An evaluation of directory protocols for medium-scale shared-memory multiprocessors , 1994, ICS '94.

[3]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[4]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[5]  Abhiram G. Ranade,et al.  How to emulate shared memory , 1991, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[6]  Carolyn Craig Williams,et al.  Concurrency control in asynchronous computations , 1993 .

[7]  Xiaola Lin,et al.  Deadlock-free multicast wormhole routing in multicomputer networks , 1991, ISCA '91.

[8]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[9]  K. Mani Chandy,et al.  Distributed Simulation: A Case Study in Design and Verification of Distributed Programs , 1979, IEEE Transactions on Software Engineering.

[10]  LenoskiDaniel,et al.  The SGI Origin , 1997 .

[11]  Erik Hagersten,et al.  Race-Free Interconnection Networks and Multiprocessor Consistency , 1991, ISCA.

[12]  W. Daniel Hillis,et al.  The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[13]  Henri E. Bal,et al.  Efficient multicast on Myrinet using link-level flow control , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[14]  James R. Goodman Using cache memory to reduce processor-memory traffic , 1998, ISCA '98.

[15]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[16]  P. Stenstrom A Cache Consistency Protocol For Multiprocessors With Multistage Networks , 1989, The 16th Annual International Symposium on Computer Architecture.

[17]  Mark D. Hill,et al.  Using Lamport clocks to reason about relaxed memory models , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[18]  Mark D. Hill,et al.  Lamport clocks: verifying a directory cache-coherence protocol , 1998, SPAA '98.

[19]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[20]  James R. Larus,et al.  Wisconsin Wind Tunnel II: a fast, portable parallel architecture simulator , 2000, IEEE Concurr..

[21]  Mark D. Hill,et al.  Lamport Clocks: Reasoning About Shared Memory Correctness1 , 1999 .

[22]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[23]  Anoop Gupta,et al.  Cache Invalidation Patterns in Shared-Memory Multiprocessors , 1992, IEEE Trans. Computers.

[24]  Tom Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[25]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[26]  Dhabaleswar K. Panda,et al.  Implementing Multidestination Worms in Switch-Based Parallel Systems: Architectural Alternatives and Their Impact , 2000, IEEE Trans. Parallel Distributed Syst..

[27]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[28]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[29]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[30]  Henri E. Bal,et al.  Orca: A Language For Parallel Programming of Distributed Systems , 1992, IEEE Trans. Software Eng..

[31]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[32]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[33]  Prasant Mohapatra,et al.  Tree-based multicasting on wormhole routed multistage interconnection networks , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[34]  Mark D. Hill,et al.  Lamport Clocks : Reasoning About Shared Memory Correctness , 1998 .

[35]  R. M. Fujimoto,et al.  Parallel discrete event simulation , 1989, WSC '89.

[36]  James R. Goodman,et al.  Performance of Pruning-Cache Directories for Large-Scale Multiprocessors , 1993, IEEE Trans. Parallel Distributed Syst..

[37]  Prasant Mohapatra,et al.  Asynchronous Tree-Based Multicasting in Wormhole-Switched MINs , 1999, IEEE Trans. Parallel Distributed Syst..

[38]  Prasant Mohapatra,et al.  A hardware multicast routing algorithm for two-dimensional meshes , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[39]  Paul F. Reynolds,et al.  Isotach Networks , 1997, IEEE Trans. Parallel Distributed Syst..

[40]  James R. Larus,et al.  Efficient support for irregular applications on distributed-memory machines , 1995, PPOPP '95.

[41]  LamportLeslie Time, clocks, and the ordering of events in a distributed system , 1978 .

[42]  Bronis R. de Supinski,et al.  Logical time coherence maintenance , 1998 .

[43]  Andrew W. Wilson,et al.  Hierarchical cache/bus architecture for shared memory multiprocessors , 1987, ISCA '87.

[44]  John B. Andrew,et al.  Notification and Multicast Networks for Synchronization and Coherence , 1992, J. Parallel Distributed Comput..

[45]  James Laudon,et al.  The SGI Origin: A ccNUMA Highly Scalable Server , 1997, ISCA.