Implementing a cache consistency protocol

We present an ownership-based multiprocessor cache consistency protocol, designed for implementation by a single chip VLSI cache controller. The protocol is compared with other shared bus multiprocessor protocols, and is shown to be an improvement in terms of its additional burden on the system bus. The design has been carried through to layout in a P-Well CMOS technology to illuminate the important implementat ion issues, the most crucial being the controller critical sections and the interand intra-cache interlocks needed to maintain cache consistency. Kql Words and Phrases: Shared Bus Multiprocessor Cache Consistency, Single Chip Implementation, Snooping Caches, Ownership-Based Protocols. 1. I n t r o d u c t i o n Shared-memory multiprocessor systems, with caches associated with each processor, require a mechanism for maintaining cache consistency; i.e., all cache entries for the same block of memory mus t have identical values. For example, if two processors locally cache the same memory location, and one updates the location without informing the other, then an inconsistent s ta te will arise. Two processors reading the same location will obtain different results! Our architectural model is simple: a collection of processors, each with its own local cache, connected to each other and to main memory and I /O devices by a single shared system bus. The bus is a critical sys tem resource, and its availability to service processor node requests will have a great effect on system performance. We seek cache consistency solutions tha t require a small amount of additional work for the sys tem bus. There has been a great deal of recent interest in cache consistency protocols for shared bus multiprocessors (e.g., lARCH84, FRAN84, GOOD83, MCCR84, PAPA84, RUDO841). In this paper, w e compare several protocols and propose one which is an improvement on these for common cases of shared data. It has been designed for implementat ion by an inteorated snooping data cache, a single chip sys tem consisting of (1) a da ta cache memory, (2) a cache controller tha t interfaces with the processor, and (3) a "snooping controller" tha t monitors the system bus. The cache and snoop controllers together implement the protocol. In section 2, the multiprocessor cache consistency problem is described. We present our cache consistency protocol, and compare it with other proposals. Section 3 highlights the aspects of the chip architecture tha t are relevant to the protocol implementation. The implementation is analyzed for its critical sections in section 4, where we argue tha t the implementation is race-free. Our summary and conclusions are given in section 5. IRese&reh supported by De/ease Advanced Research ProjectB Agency's Strategic Computing Infrastructure PregrLrn under the SPUR contract. I . M u l t l p r o c e a s o r C a c h e C o n s i s t e n c y 2.1. C o n s i s t e n c y I s s u e s a n d P r o t o c o l s From the programmer 's viewpoint, a sys tem with a cache behaves functionally as one without a cache Therefore, when caches are distributed among multiple processors, the system must ensure tha t a consistent view of memory is maintained. A read of a block by one processor, following its write by another, mus t obtain the updated block. One solution is to make uneacheable those portions of memory tha t are write shareable, z The alternative is to implement a cache consistency protocol which constrains the sequence of processor node reads and writes. Sometimes additional bus operations are generated in order to maintain the consistent view of memory. Early work in this area was based on eentralized control [CENS78 l, but more recent work has focused on distributed protocols for sh:~red bus multiprocessors. In this paper, we focus on protocols for Snooping Data Caches. The Snoop monitors sys tem bus transact ions and may manipulate the cache based on these czternal requests. Its sophistication varies with the protocol it implements. Three kinds of snooping cache protocols have been proposed. A write-through strategy [AGRA77] writes all cache upda.tes through to the memory system. Caches of other processors on the bus mus t monitor bus transactions, and invalidate any entries tha t match when the memory block is written through. A proeessor's performance may be significantly degraded on writes if the processor mus t wait until the write is complete. A second strategy is called write-first [GOOD83, THAC82 I. On the first write to a cached entry, the update is written through to memory. This forces other caches to invalidate a matching entry, thus guaranteeing tha t the wri t ing processor holds the only cached copy. Subsequent writes can be performed in the cache. A processor read will be serviced either by the memory or by a cache, whichever has 'the most up.to-date version of the block. This protocol is more complicated to implement than write-through, because the Snoop mus t service external read requests as well as invalidate cache entries. A potential d isadvantage of write-first is t ha t it incurs an initial write to memory even when a memory block is not being shared. However, this represents an extra memory write only if there are further processor writes to the memory block. The third strategy is called ownership (e.g., [FRAN84]). A processor mus t "own" a block of memory before it is allowed to update it. Ownership is acquired through special read and write operations. By indicating the possibility of modifying a block a t the time of its read, the "invalidating" writes exhibited in the above protocols are avoided. However, additional bus transactions may be incurred if the processor does not correctly predeclare its intentions. • ~ven thk may not be /n.tl~cient for writetble private data if proeemea are allowed to migrate among procca~ors. 0149-7111/85/0000/0276501.00 © 1985 IEEE 276 The protocol to be described falls into this class. A new class of protocols are emerging based on write-broudcast [BUDO84, MCCR84]. In these, a Snoop handles an external write by overwriting its matching cache entry rather than invalidating it. For example, [RUDO84] describes a mixed broadcast/write-first protocol. The first write is broadcast to other caches while being written through to memory. A subsequent write by the same cache generates a bus operation to invalidate other copies. Thus, the caches dynamically distinguish between local blocks (those to whom multiple writes are directed by the same processor) and shared blocks (those to whom local writes are interleaved with external reads and writes). The protocol improves on write-first for shared variables, but makes multiple writes to non.shared variables even more expensive. Some protocol issues are independent of the strategy employed. These are: (1) whether the memory controller is "dumb" or "smart": either memory is inhibited by the responding cache or the memory controller knows whether to respond with the requested block, and (2) whether shared blocks are returned to global memory or kept in the caches: e.g., both [GOOD831 [write-first) and [FRAN84] (owner6hip) require that the cache return a block to main memory after it has been requested by another processor. Main memory than provides the block to the requesting processor. The Berkeley Protocol, described below, implements an ownership strategy with owning caches inhibiting memory and owned blocks being kept in the cache. 2.S. T h e Berkeley Owner sh ip P ro toeoh Staten and O p e r a t i o n s The Berkeley Protocol is designed for shared bus multiprocessors. In its design, we had the following objectives and constraints: (1) minimize the number of additional bus actions needed to maintain consistency, thus making data sharing reasonably cheap, (2) avoid memory system design, so that commercially available memory boards could be used, and (3) avoid backplane design, although additional signals could be added to an existing backplane and bus protocol to support special communications among the caches. Before presenting the protocols states and operations, we first define some terms. A block is a logical unit of memory consisting of one or a small number of words. It is identified by its address, and is the unit of transfer between main memory and the caches. Copies of a block's contents can simultaneously reside in main memory and/or in several of the cache memories. A cache entry is a physical slot within cache memory that consists of a data portion, a tag, and a state. It is analogous to a page frame in a virtual memory system. The data portion holds the cached copy of a memory block. The tag is the portion of the block address that is used by the cache's address mapping algorithm to determine whether the block is in the cache. Since different blocks are mapped to the same entry, the tag distinguishes among these. The state encodes the state of the data portion of the cache entry. For the Berkeley Protocol, the possible states are: Inval id , U n O w n e d , Owned Exclusively, or Owned NonExchmlvely. s Copies of a memory block can reside in more than one cache. At most one cache owns the block, and the owner is the only cache allowed to update it. Owning a block also obligates the owner to provide the data to other requesting caches and to update main memory when the block is replaced in the cache. If the state is Owned Exclusively, the owning cache holds the only cached copy of the block. Updates can occur locally, without informing the other caches. A state of Owned NonExclunlvely implies that other caches have copies and must be informed about updates to the block. The U n O w n e d state carries neither ownership rights nor responsibilities for a cache. In this case, several caches may have copies of the block. A brief summary of the implications of each state are given in Table 2.1. A cache entry's state can change in response to a system bus or processor operation that affects its validity or exclusiveness.