The effectiveness of SRAM network caches in clustered DSMs

The frequency of accesses to remote data is a key factor affecting the performance of all Distributed Shared Memory (DSM) systems. Remote data caching is one of the most effective and general techniques to fight processor stalls due to remote capacity misses in the processor caches. The design space of remote data caches (RDC) has many dimensions and one essential performance trade-off hit ratio versus speed. Some recent commercial systems have opted for large and slow (S)DRAM network caches (NC), but others completely avoid them because of their damaging effects on the remote/local latency ratio. In this paper we will explore small and fast SRAM network caches as a means to reduce the remote stalls and capacity traffic of multiprocessor clusters. The major appeal of SRAM NCs is that they add less penalty on the latency of NC hits and remote accesses. Their small capacity can handle conflict misses and a limited amount of capacity misses. However, they can be coupled with main memory page caches which satisfy the bulk of capacity misses. To maximize performance for a large spectrum of applications, we propose to organize the NC as a victim cache for remote data. We also propose a novel and scalable method to control the page cache by integrating page relocation mechanisms into the network victim cache.

[1]  Michel Dubois,et al.  Dynamic page migration in multiprocessors with distributed global memory , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[2]  Josep Torrellas,et al.  Reducing remote conflict misses: NUMA with remote cache versus COMA , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[3]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[4]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[5]  Truman Joe COMA-F: a non-hierarchical cache only memory architecture , 1995 .

[6]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[7]  John B. Carter,et al.  An argument for simple COMA , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[8]  Norman P. Jouppi,et al.  Tradeoffs in two-level on-chip caching , 1994, ISCA '94.

[9]  T. Wicki,et al.  The Mercury Interconnect Architecture: A Cost-effective Infrastructure For High-performance Servers , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[11]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[12]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[13]  Wen-Hann Wang,et al.  On the Inclusion Properties for Multi-Level Cache Hierarchies , 1988, ISCA.

[14]  Thorsten von Eicken,et al.  技術解説 IEEE Computer , 1999 .

[15]  D.A. Wood,et al.  Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[16]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[17]  Anders Landin,et al.  Bus-based COMA-reducing traffic in shared-bus multiprocessors , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[18]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[19]  Anoop Gupta,et al.  Operating system support for improving data locality on CC-NUMA compute servers , 1996, ASPLOS VII.

[20]  Jinseok Kong,et al.  Relaxing the Inclusion Property in Cache Only Memory Architecture , 1996, Euro-Par, Vol. II.

[21]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[22]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[23]  Katherine E. Fletcher,et al.  Techniques For Reducing the Impact of Inclusion in Shared Network Cache Multiprocessors , 1994 .

[24]  Ricardo Bianchini,et al.  Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[25]  Carla Schlatter Ellis,et al.  Experimental comparison of memory management policies for NUMA multiprocessors , 1991, TOCS.