Switch cache: a framework for improving the remote memory access latency of CC-NUMA multiprocessors

Cache coherent non-uniform memory access (CC-NUMA) multiprocessors continue to suffer from remote memory access latencies due to comparatively slow memory technology and data transfer latencies in the interconnection network. We propose a novel hardware caching technique, called switch cache. The main idea is to implement small fast caches in crossbar switches of the interconnect medium to capture and store shared data as they flow from the memory module to the requesting processor. This stored data acts as a cache for subsequent requests, thus reducing the latency of remote memory accesses tremendously. The implementation of a cache in a crossbar switch needs to be efficient and robust, yet flexible for changes in the caching protocol. The design and implementation details of a CAche Embedded Switch ARchitecture, CAESAR, using wormhole routing with virtual channels is presented. Using detailed execution-driven simulations, we find that the CAESAR switch cache is capable of improving the performance of CC-NUMA multiprocessors by reducing the number of reads served at distant remote memories by up to 45% and improving the application execution time by as high as 20%. We conclude that the switch caches provide a cost-effective solution for designing high performance CC-NUMA multiprocessors.

[1]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[2]  Charles E. Leiserson,et al.  The Networks of the Connection Machine CM-5 , 1992, Heinz Nixdorf Symposium.

[3]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[4]  Andrew W. Wilson,et al.  Hierarchical cache/bus architecture for shared memory multiprocessors , 1987, ISCA '87.

[5]  Adrian Moga,et al.  The effectiveness of SRAM network caches in clustered DSMs , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[6]  Tom Shanley,et al.  Pentium Processor System Architecture , 1993 .

[7]  W. Daniel Hillis,et al.  The network architecture of the Connection Machine CM-5 (extended abstract) , 1992, SPAA '92.

[8]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[9]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[10]  Tom Shanley,et al.  Pentium Pro processor system architecture , 1997, PC system architecture series.

[11]  William J. Dally Virtual-channel flow control , 1990, ISCA '90.

[12]  Josep Torrellas,et al.  Reducing remote conflict misses: NUMA with remote cache versus COMA , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[13]  Basem A. Nayfeh,et al.  The impact of shared-cache clustering in small-scale shared-memory multiprocessors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[14]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[15]  Laxmi N. Bhuyan,et al.  Design and Analysis of Cache Coherent Multistage Interconnection Networks , 1993, IEEE Trans. Computers.

[16]  A. Gottleib,et al.  The nyu ultracomputer- designing a mimd shared memory parallel computer , 1983 .

[17]  Laxmi N. Bhuyan,et al.  Impact of switch design on the application performance of cache-coherent multiprocessors , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[18]  Tom Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[19]  Dennis G. Shea,et al.  The SP2 High-Performance Switch , 1995, IBM Syst. J..

[20]  Norman P. Jouppi,et al.  WRL Research Report 93/5: An Enhanced Access and Cycle Time Model for On-chip Caches , 1994 .

[21]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[22]  Sarita V. Adve,et al.  RSIM Reference Manual: Version 1.0 , 1997 .