A comparative evaluation of hybrid distributed shared-memory systems

Distributed Shared-Memory (DSM) systems are shared-memory multiprocessor architectures in which each processor node contains a partition of the shared memory. In hybrid DSM systems coherence among caches is maintained by a software-implemented coherence protocol relying on some hardware support. Hardware support is provided to satisfy every node hit (the common case) and software is invoked only for accesses to remote nodes. In this paper we compare the design and performance of four hybrid distributed shared memory (DSM) organizations by detailed simulation of the same hardware platform. We have implemented the software protocol handlers for the four architectures. The handlers are written in C and assembly code. Coherence transactions are executed in trap and interrupt handlers. Together with the application, the handlers are executed in full detail in execution-driven simulations of six complete benchmarks with coarse-grain and fine-grain sharing. We relate our experience implementing and simulating the software protocols for the four architectures. Because the overhead of remote accesses is very high in hybrid systems, the system of choice is different than for purely hardware systems.

[1]  D.A. Wood,et al.  Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[2]  Pierre Kuonen,et al.  Parallel Computer Architectures for Commodity Computing , 1999 .

[3]  Michael L. Scott,et al.  Software cache coherence for large scale multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[4]  Josep Torrellas,et al.  Reducing remote conflict misses: NUMA with remote cache versus COMA , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[5]  Anoop Gupta,et al.  Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[6]  Anant Agarwal,et al.  Closing the window of vulnerability in multiphase memory transactions , 1992, ASPLOS V.

[7]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[8]  Håkan Grahn,et al.  Efficient strategies for software-only protocols in shared-memory multiprocessors , 1995, ISCA.

[9]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[10]  Larry L. Peterson,et al.  The x-kernel: a platform for accessing internet resources , 1990, Computer.

[11]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[12]  Trevor N. Mudge,et al.  Power: A First-Class Architectural Design Constraint , 2001, Computer.

[13]  Montgomery Phister A Proposed Course in Data Processing Economics , 1976, Computer.

[14]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[16]  Ranjit Noronha,et al.  CanHighPerformance Software DSM Systems Designed WithInfiniBand Features Benefit fromPCI-Express? * , 2005 .

[17]  John L. Hennessy,et al.  Latency, Occupancy, and Bandwidth in DSM Multiprocessors: A Performance Evaluation , 2003, IEEE Trans. Computers.

[18]  Alan L. Cox,et al.  Evaluation of release consistent software distributed shared memory on emerging network technology , 1993, ISCA '93.

[19]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[20]  M. Dubois,et al.  Tolerating late memory traps in ILP processors , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[21]  Balaram Sinharoy,et al.  IBM Power5 chip: a dual-core multithreaded processor , 2004, IEEE Micro.

[22]  H. Grahn,et al.  Efficient strategies for software-only directory protocols in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[23]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[24]  Kai Li,et al.  IVY: A Shared Virtual Memory System for Parallel Computing , 1988, ICPP.

[25]  Anant Agarwal,et al.  Software-extended coherent shared memory: performance and cost , 1994, ISCA '94.

[26]  Adrian Moga,et al.  Scalability implications of software-implemented coherence , 2003, Comput. Syst. Sci. Eng..

[27]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[28]  Truman Joe COMA-F: a non-hierarchical cache only memory architecture , 1995 .

[29]  Sally A. McKee,et al.  Reflections on the memory wall , 2004, CF '04.

[30]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[31]  Ricardo Bianchini,et al.  Comparing Latency-Tolerance Techniques for Software DSM Systems , 2003, IEEE Trans. Parallel Distributed Syst..

[32]  J. Larus,et al.  Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[33]  John B. Carter,et al.  An argument for simple COMA , 1995, Future Gener. Comput. Syst..

[34]  Dhabaleswar K. Panda,et al.  Can high performance software DSM systems designed with InfiniBand features benefit from PCI-Express? , 2005, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, 2005..