On the use and performance of explicit communication primitives in cache-coherent multiprocessor systems

Recent developments in shared-memory multiprocessor systems advocate using off-the-shelf hardware to provide basic communication mechanisms and using software to implement cache coherence policies. The exposure of communication mechanisms to software opens many opportunities for enhancing application performance. In this paper we propose a set of communication primitives implemented on a communication co-processor that introduce a flavor of message passing and permit protocol optimization. To assess the overhead of the software implementation of the primitives and protocols, we compare a PRAM model, a hardware cache coherence scheme, a software scheme implementing only the basic cache coherence protocol, and an optimized software solution supporting the additional communication primitives and running with applications annotated with those primitives. With the parameters we chose for the communication processor, the overall memory system overhead of the basic software scheme is at least 50% higher than that of the hardware implementation. With the adequate insertion of the communication primitives, the optimized software solution has a performance comparable to that of the hardware scheme.

[1]  James R. Larus,et al.  Cachier: A Tool for Automatically Inserting CICO Annotations , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[2]  John L. Hennessy,et al.  The performance advantages of integrating block data transfer in cache-coherent multiprocessors , 1994, ASPLOS VI.

[3]  Paul Pierce The NX Message Passing Interface , 1994, Parallel Comput..

[4]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[5]  Jean-Loup Baer,et al.  A performance study of software and hardware data prefetching schemes , 1994, ISCA '94.

[6]  Anant Agarwal,et al.  Anatomy of a message in the Alewife multiprocessor , 1993, ICS '93.

[7]  Jehoshua Bruck,et al.  The IBM External User Interface for Scalable Parallel Systems , 1994, Parallel Comput..

[8]  Kevin P. McAuliffe,et al.  Automatic Management of Programmable Caches , 1988, ICPP.

[9]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessor , 1992, ASPLOS V.

[10]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[11]  Anoop Gupta,et al.  The performance impact of flexibility in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[12]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[13]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[14]  Robert J. Fowler,et al.  MINT: a front end for efficient simulation of shared-memory multiprocessors , 1994, Proceedings of International Workshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[15]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[16]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[17]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[18]  Anoop Gupta,et al.  Integration of message passing and shared memory in the Stanford FLASH multiprocessor , 1994, ASPLOS VI.

[19]  David A. Wood,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, ISCA.

[20]  H GornishEdward,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990 .

[21]  Pen-Chung Yew,et al.  Integrating Fine-Grained Message Passing in Cache Coherent Shared Memory Multiprocessors , 1996, J. Parallel Distributed Comput..

[22]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[23]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[24]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[25]  Mary K. Vernon,et al.  A Hybrid Shared Memory/Message Passing Parallel Machine , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[26]  Anant Agarwal,et al.  Integrating message-passing and shared-memory: early experience , 1993, SIGP.

[27]  Alexander V. Veidenbaum,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990 .

[28]  Robert J. Fowler,et al.  Adaptive cache coherency for detecting migratory shared data , 1993, ISCA '93.

[29]  Mats Brorsson,et al.  An adaptive cache coherence protocol optimized for migratory sharing , 1993, ISCA '93.

[30]  Gautam Shah,et al.  Architectural Mechanisms for Explicit Communication in Shared Memory Multiprocessors , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[31]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[32]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[33]  Anand Sivasubramaniam,et al.  Architectural Mechanisms for Explicit Communication in Shared Memory Multiprocessors , 1995, SC.