We describe the memory system design of a tightly-coupled multiprocessor. Each processor node consists of a VLSI RISC processor, a VLSI cache controller, cache data RAMs, and a standard bus. We show that adequate performance can be achieved only if the processor has an on-chip instruction buffer and a large (64KB-256KB) local instruction and data cache. Ways of reducing the number of cache tags and the effects of various implementation alternatives for where to perform virtual memory translation are also described.