Optimizing communication in HPF programs on fine-grain distributed shared memory

Unlike compiler-generated message-passing code, the coherence mechanisms in shared-memory systems work equally well for regular and irregular programs. In many programs, however compile-time information about data accesses would permit data to be transferred more efficiently---if the underlying shared-memory system offered suitable primitives. This paper demonstrates that cooperation between a compiler and a memory coherence protocol can improve the performance of High Performance Fortran (HPF) programs running on fine-grain distributed shared memory system up to a factor of 2, while retaining the versatility and portability of shared memory. As a consequence, shared memory's performance becomes competitive with message passing for regular applications, while not affecting (or in some cases, even improving) its large advantage for irregular codes. This paper describes the design of our implementation and reports experimental results.

[1]  Kourosh Gharachorloo,et al.  Shasta: a low overhead, software-only approach for supporting fine-grain shared memory , 1996, ASPLOS VII.

[2]  Evgenia Smirni,et al.  The KSR1: experimentation and modeling of poststore , 1993, SIGMETRICS '93.

[3]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[4]  S.K. Reinhardt,et al.  Decoupled Hardware Support for Distributed Shared Memory , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[5]  Anoop Gupta,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, ISCA '91.

[6]  Kevin P. McAuliffe,et al.  Automatic Management of Programmable Caches , 1988, ICPP.

[7]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[8]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[9]  Chau-Wen Tseng An optimizing Fortran D compiler for MIMD distributed-memory machines , 1993 .

[10]  Edith Schonberg,et al.  A Unified Framework for Optimizing Communication in Data-Parallel Programs , 1996, IEEE Trans. Parallel Distributed Syst..

[11]  Anne Rogers Compiling for locality of reference , 1990 .

[12]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessor , 1992, ASPLOS V.

[13]  Charles Koelbel,et al.  Compiling Global Name-Space Parallel Loops for Distributed Execution , 1991, IEEE Trans. Parallel Distributed Syst..

[14]  Ruby B. Lee,et al.  Tempest: a substrate for portable parallel programs , 1995 .

[15]  Margaret Martonosi,et al.  Evaluating the impact of advanced memory systems on compiler-parallelized codes , 1995, PACT.

[16]  James R. Larus,et al.  Implementing Fine-grain Distributed Shared Memory on Commodity SMP Workstations , 1996 .

[17]  T. Lovett,et al.  STiNG: A CC-NUMA Computer System for the Commercial Marketplace , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[18]  Alan L. Cox,et al.  Evaluating the performance of software distributed shared memory as a target for parallelizing compilers , 1997, Proceedings 11th International Parallel Processing Symposium.

[19]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[20]  James R. Larus,et al.  Teapot: language support for writing memory coherence protocols , 1996, PLDI '96.

[21]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[22]  Alexander V. Veidenbaum,et al.  Compiler-directed cache management in multiprocessors , 1990, Computer.

[23]  Josep Torrellas,et al.  Data forwarding in scalable shared-memory multiprocessors , 1995, ICS '95.

[24]  Alan L. Cox,et al.  An integrated compile-time/run-time software distributed shared memory system , 1996, ASPLOS VII.

[25]  James R. Larus,et al.  HPF on Fine-Grain Distributed Shared Memory: Early Experience , 1996, LCPC.

[26]  James R. Larus,et al.  Application-specific protocols for user-level shared memory , 1994, Proceedings of Supercomputing '94.

[27]  James R. Larus,et al.  Efficient support for irregular applications on distributed-memory machines , 1995, PPOPP '95.

[28]  James R. Larus,et al.  Tempest: a substrate for portable parallel programs , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[29]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[30]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[31]  William Pugh,et al.  Eliminating false data dependences using the Omega test , 1992, PLDI '92.

[32]  James R. Larus,et al.  LCM: memory system support for parallel language implementation , 1994, ASPLOS VI.

[33]  Chau-Wen Tseng,et al.  Enhancing software DSM for compiler-parallelized applications , 1997, Proceedings 11th International Parallel Processing Symposium.

[34]  James R. Larus,et al.  Cooperative shared memory: software and hardware for scalable multiprocessors , 1993, TOCS.

[35]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.