Switch MSHR: A Technique to Reduce Remote Read Memory Access Time in CC-NUMA Multiprocessors

A remote memory access poses a severe problem for the design of CC-NUMA multiprocessors because it takes an order of magnitude longer than the local memory access. The large latency arises partly due to the increased distance between the processor and remote memory over the interconnection network. In this paper, we develop a new switch architecture, called Switch MSHR (SMSHR), which provides the cache block to the requesting processors without those requests having to go to the home memory. The SMSHR idea is based on providing a few miss status holding registers (MSHRs) in each switch that keep track of read requests to the memory. The SMSHR blocks secondary requests to the same memory block and provides them with a copy of the block when the primary reply returns. The SMSHR design is then extended to include a switch cache, which can temporarily save a copy of the data block for later use. We provide basic block designs for the SMSHR and SIVISHR+cache architectures in this paper. We explore the design space by modeling the new switch architectures in a detailed execution-driven simulator and analyze the performance benefits. Our Simulation results show that applications with a high degree of data sharing benefit tremendously from the SMSHR and SMSHR+cache techniques.

[1]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[2]  John K. Bennett,et al.  The performance value of shared network caches in clustered multiprocessor workstations , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[3]  Laxmi N. Bhuyan,et al.  Design and Analysis of Cache Coherent Multistage Interconnection Networks , 1993, IEEE Trans. Computers.

[4]  Stefanos Kaxiras,et al.  A study of three dynamic approaches to handle widely shared data in shared-memory multiprocessors , 1998, ICS '98.

[5]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[6]  Maged M. Michael,et al.  Design and performance of directory caches for scalable shared memory multiprocessors , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[7]  Katherine E. Fletcher,et al.  Techniques For Reducing the Impact of Inclusion in Shared Network Cache Multiprocessors , 1994 .

[8]  John B. Carter,et al.  An argument for simple COMA , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[9]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[10]  Jianer Chen,et al.  Efficient memory management and interconnection schemes for cc-numa multiprocessors , 2002 .

[11]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[12]  William J. Dally,et al.  Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..

[13]  Laxmi N. Bhuyan,et al.  Design and Evaluation of a Switch Cache Architecture for CC-NUMA Multiprocessors , 2000, IEEE Trans. Computers.

[14]  T. Wicki,et al.  The Mercury Interconnect Architecture: A Cost-effective Infrastructure For High-performance Servers , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[15]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[16]  Adrian Moga,et al.  The effectiveness of SRAM network caches in clustered DSMs , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[17]  Laxmi N. Bhuyan,et al.  Using switch directories to speed up cache-to-cache transfers in CC-NUMA multiprocessors , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[18]  Stefanos Kaxiras,et al.  Kiloprocessor Extensions to SCI , 1996, Proceedings of International Conference on Parallel Processing.

[19]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[20]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[21]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[22]  A. Gottleib,et al.  The nyu ultracomputer- designing a mimd shared memory parallel computer , 1983 .

[23]  Kunle Olukotun,et al.  Exploring the design space for a shared-cache multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.