Optimizing Software Cache-coherent Cluster Architectures

Software cache-coherent systems using programmable protocol processors provide a flexible infrastructure to expand the systems in size and function. However this flexibility comes at a cost in performance. First, the software implementation of protocols is inherently slower than a hardware implementation. Second, when multiple processors share a protocol processor, contention may result in a substantial increase in memory latency. In this paper, we study how the overhead of a software scheme can be reduced in the context of a shared- memory system consisting of SMP clusters. We study various design choices including hardware assists such as forwarding logic in the protocol processor and software hints through explicit communication primitives. We conduct our experiments via trace-driven simulation and compare the execution of three programs from the SPLASH-2 suite. We found that small cluster sizes (up to 4 processors/node) work well for both hardware and software implementations. When the forwarding logic is incorporated with the software scheme, the performance is competitive to that of the hardware scheme. When enhanced further by explicit communication primitives, the software scheme can perform even better than a pure hardware implementation. This is particularly noticeable when the network latency is high.

[1]  Larry Rudolph,et al.  Evaluation of Design Choices for Gang Scheduling Using Distributed Hierarchical Control , 1996, J. Parallel Distributed Comput..

[2]  Honbo Zhou,et al.  The EASY - LoadLeveler API Project , 1996, JSSPP.

[3]  John K. Ousterhout,et al.  Scheduling Techniques for Concurrent Systems , 1982, ICDCS.

[4]  Jean-Loup Baer,et al.  On the use and performance of communication primitives in software controlled cache-coherent cluster architectures , 1997 .

[5]  Dror G. Feitelson,et al.  Improved Utilization and Responsiveness with Gang Scheduling , 1997, JSSPP.

[6]  Jean-Loup Baer,et al.  On the use and performance of explicit communication primitives in cache-coherent multiprocessor systems , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[7]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[8]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[9]  Marios C. Papaefthymiou,et al.  A Gang Scheduling Design for Multiprogrammed Parallel Computing Environments , 1996, JSSPP.

[10]  Richard Wolski,et al.  Time Sharing Massively Parallel Machines , 1995, ICPP.

[11]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[12]  Robert J. Fowler,et al.  MINT: a front end for efficient simulation of shared-memory multiprocessors , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[13]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[14]  S.K. Reinhardt,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[15]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[16]  Kuniyasu Suzaki,et al.  Implementing the Combination of Time Sharing and Space Sharing on AP/Linux , 1998, JSSPP.

[17]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[18]  Maged M. Michael,et al.  Coherence controller architectures for SMP-based CC-NUMA multiprocessors , 1997, ISCA '97.

[19]  Larry Rudolph,et al.  Distributed hierarchical control for parallel processing , 1990, Computer.

[20]  Victor Lee,et al.  Implications of I/O for Gang Scheduled Workloads , 1997, JSSPP.

[21]  Mark S. Squillante,et al.  Extensible resource management for cluster computing , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.

[22]  L. Rudolph,et al.  Gang scheduling for highly efficient, distributed multiprocessor systems , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[23]  Marios C. Papaefthymiou,et al.  Performance Evaluation of Gang Scheduling for Parallel and Distributed Multiprogramming , 1997, JSSPP.

[24]  Sarita V. Adve,et al.  An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[25]  John L. Hennessy,et al.  The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors , 1995 .

[26]  Michael C. Browne,et al.  Exploiting Parallelism in Cache Coherency Protocol Engines , 1995, Euro-Par.

[27]  Gautam Shah,et al.  Architectural Mechanisms for Explicit Communication in Shared Memory Multiprocessors , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[28]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[29]  Uwe Schwiegelshohn,et al.  Improving First-Come-First-Serve Job Scheduling by Gang Scheduling , 1998, JSSPP.

[30]  Morris A. Jette Expanding Symmetric Multiprocessor Capability Through Gang Scheduling , 1998, JSSPP.

[31]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.