Design and performance of the software-controlled coma

Traditionally, cache coherence in multiprocessors has been maintained in hardware. However, the cost-effectiveness of hardware protocols for Distributed Shared Memory (DSM) systems is questionable. Virtual Shared Memory systems have highlighted the many advantages of software-implemented protocols, albeit at a performance price. The performance gap is narrowed by hybrid systems with software-implemented coherence protocols and hardware support for fine-grain access control. This work contains the first proposal and evaluation of a hybrid COMA (Cache-Only Memory Architecture). The system is called SC-COMA for Software-Controlled COMA, to emphasize that the protocol engine is emulated by software executed on the main processor. Contrary to user-level protocols, the software handling coherence events in SC-COMA runs in sub-kernel mode, transparently and efficiently providing the same services to applications as a hardware counterpart. SC-COMA is employing a novel coherence protocol, optimized for a hybrid implementation, which has been fully implemented. The support for fine-grain access control is embedded in the memory controller. The evaluation methodology is based on execution-driven simulation of complete applications from the SPLASH-2 suite. Results show that SC-COMA is competitive and a viable solution to easily transform networks of workstations into powerful multiprocessors. On systems with 32 processors, it achieves a slowdown of 11-56% with respect to an aggressive hardware counterpart, across a range of applications and memory overhead. Scalability is good and faster processors favorably affect the performance. An investigation on the impact of memory organization on the performance of hybrid systems reveals that, in most of a wide range of cases, COMA outperforms other alternatives: CC-NUMA, Simple COMA, and RC-NUMA due to the lower node miss ratio. The performance of SC-COMA is further improved by three techniques: relaxed inclusion, mastership hints, and replacement hints. Even more significant improvements are obtained by adapting the SC-COMA approach to other hardware platforms: symmetric multiprocessor (SMP) nodes and processors with non-blocking stores.

[1]  Babak Falsafi,et al.  When does Dedicated Protocol Processing Make Sense , 1996 .

[2]  R. N. Zucker,et al.  Software versus hardware coherence: performance versus cost , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[3]  Anders Landin,et al.  A study of the efficiency of shared attraction memories in cluster-based COMA multiprocessors , 1997, Proceedings 11th International Parallel Processing Symposium.

[4]  Randy H. Katz,et al.  The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS III.

[5]  Anders Landin,et al.  Reducing the replacement overhead in bus-based COMA multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[6]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[7]  J. Larus,et al.  Tempest and Typhoon: user-level shared memory , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[8]  Michael C. Browne,et al.  Exploiting Parallelism in Cache Coherency Protocol Engines , 1995, Euro-Par.

[9]  Liviu Iftode,et al.  Relaxed consistency and coherence granularity in DSM systems: a performance evaluation , 1997, PPOPP '97.

[10]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[11]  Kunle Olukotun,et al.  The Benefits of Clustering in Shared Address Space Multiprocessors: An Applications-Driven Investigation , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[12]  Josep Torrellas,et al.  Enhancing memory use in Simple Coma: Multiplexed Simple Coma , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[13]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[14]  Kai Li,et al.  IVY: A Shared Virtual Memory System for Parallel Computing , 1988, ICPP.

[15]  How Processor-memory Integration Aaects the Design of Dsms 1 , 1997 .

[16]  Jinseok Kong,et al.  Relaxing the Inclusion Property in Cache Only Memory Architecture , 1996, Euro-Par, Vol. II.

[17]  David A. Wood,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, ISCA.

[18]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[19]  Margaret Martonosi,et al.  Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors , 1996, ISCA.

[20]  Anoop Gupta,et al.  Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[21]  Anant Agarwal,et al.  Closing the window of vulnerability in multiphase memory transactions , 1992, ASPLOS V.

[22]  Babak Falsafi,et al.  Scheduling communication on an SMP node parallel machine , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[23]  David R. Cheriton,et al.  Software-Controlled Caches in the VMP Multiprocessor , 1986, ISCA.

[24]  Willy Zwaenepoel,et al.  Adaptive software cache management for distributed shared memory architectures , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[25]  Erik Hagersten,et al.  DDM - A Cache-Only Memory Architecture , 1992, Computer.

[26]  Truman Joe COMA-F: a non-hierarchical cache only memory architecture , 1995 .

[27]  John L. Hennessy,et al.  SoftFLASH: analyzing the performance of clustered distributed virtual shared memory , 1996, ASPLOS VII.

[28]  D.A. Wood,et al.  Reactive NUMA: A Design For Unifying S-COMA And CC-NUMA , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[29]  Adrian Moga,et al.  Design and Evaluation of a Software-Controlled COMA , 1996 .

[30]  P. Stenstrom A survey of cache coherence schemes for multiprocessors , 1990, Computer.

[31]  James R. Larus,et al.  Implementing Fine-grain Distributed Shared Memory on Commodity SMP Workstations , 1996 .

[32]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[33]  James C. Hoe,et al.  START-NG: Delivering Seamless Parallel Computing , 1995, Euro-Par.

[34]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[35]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[36]  Henry M. Levy,et al.  Hardware and software support for efficient exception handling , 1994, ASPLOS VI.

[37]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[38]  Alexander V. Veidenbaum,et al.  Software-directed Cache Management in Multiprocessors , 1990 .

[39]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[40]  Maged M. Michael,et al.  Coherence controller architectures for SMP-based CC-NUMA multiprocessors , 1997, ISCA '97.

[41]  Paul W. A. Stallard,et al.  The Application of Skewed-Associative Memories to Cache Only Memory Architectures , 1995, ICPP.

[42]  Pat Helland,et al.  The Mercury Interconnect Architecture: A Cost-effective Infrastructure For High-performance Servers , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[43]  James K. Archibald,et al.  Cache coherence protocols: evaluation using a multiprocessor simulation model , 1986, TOCS.

[44]  Anoop Gupta,et al.  Performance evaluation of hybrid hardware and software distributed shared memory protocols , 1994, ICS '94.

[45]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[46]  Josep Torrellas,et al.  Speeding up the memory hierarchy in Flat COMA multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[47]  Alan L. Cox,et al.  Evaluation of release consistent software distributed shared memory on emerging network technology , 1993, ISCA '93.

[48]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[49]  Kourosh Gharachorloo,et al.  Fine-grain software distributed shared memory on SMP clusters , 1998, Proceedings 1998 Fourth International Symposium on High-Performance Computer Architecture.

[50]  Adrian Moga,et al.  Hardware versus software implementation of COMA , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[51]  Anoop Gupta,et al.  OS Support for Improving Data Locality on CC-NUMA Compute Servers , 1996 .

[52]  John B. Carter,et al.  An argument for simple COMA , 1995, Future Gener. Comput. Syst..

[53]  John L. Hennessy,et al.  The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors , 1995 .

[54]  Robert W. Pfile,et al.  Typhoon-Zero Implementation: The Vortex Module , 1995 .

[55]  Tom Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[56]  Scott Devine,et al.  Using the SimOS machine simulator to study complex computer systems , 1997, TOMC.

[57]  Gyungho Lee,et al.  An assessment of COMA multiprocessors , 1995, Proceedings of 9th International Parallel Processing Symposium.

[58]  Alan L. Cox,et al.  Software versus hardware shared-memory implementation: a case study , 1994, ISCA '94.

[59]  Ricardo Bianchini,et al.  Using simple page placement policies to reduce the cost of cache fills in coherent shared-memory systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[60]  Håkan Grahn,et al.  Efficient strategies for software-only protocols in shared-memory multiprocessors , 1995, ISCA.

[61]  Richard B. Gillett Memory Channel Network for PCI , 1996, IEEE Micro.

[62]  J. Torrellas,et al.  The Illinois Aggressive Coma Multiprocessor project (I-ACOMA) , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[63]  Larry L. Peterson,et al.  The x-kernel: a platform for accessing internet resources , 1990, Computer.

[64]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[65]  Richard P. LaRowe,et al.  Hardware assist for distributed shared memory , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[66]  Kai Li,et al.  Multiprocessor Cache Coherence Based on Virtual Memory Support , 1995, J. Parallel Distributed Comput..

[67]  Michael L. Scott,et al.  Software cache coherence for large scale multiprocessors , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[68]  Josep Torrellas,et al.  Reducing remote conflict misses: NUMA with remote cache versus COMA , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[69]  Per Stenström,et al.  Using hints to reduce the read miss penalty for flat COMA protocols , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[70]  Anant Agarwal,et al.  Software-extended coherent shared memory: performance and cost , 1994, ISCA '94.

[71]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[72]  William J. Bolosky,et al.  Software coherence in multiprocessor memory systems , 1993 .

[73]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[74]  Gyungho Lee,et al.  Unallocated Memory Space in COMA Multiprocessors , 1995 .

[75]  Michael C. Browne,et al.  The S3.mp scalable shared memory multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[76]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[77]  Michel Dubois,et al.  Essential Misses and Data Traffic in Coherence Protocols , 1995, J. Parallel Distributed Comput..

[78]  Babak Falsafi,et al.  Kernel Support for the Wisconsin Wind Tunnel , 1993, USENIX Microkernels and Other Kernel Architectures Symposium.