Communication optimizations for fine-grained UPC applications

Global address space languages like UPC exhibit high performance and portability on a broad class of shared and distributed memory parallel architectures. The most scalable applications use bulk memory copies rather than individual reads and writes to the shared space, but finer-grained sharing can be useful for scenarios such as dynamic load balancing, event signaling, and distributed hash tables. In this paper we present three optimization techniques for global address space programs with fine-grained communication: redundancy elimination, use of split-phase communication, and communication coalescing. Parallel UPC programs are analyzed using static single assignment form and a dataflow graph, which are extended to handle the various shared and private pointer types that are available in UPC. The optimizations also take advantage of UPC's relaxed memory consistency model, which reduces the need for cross thread analysis. We demonstrate the effectiveness of the analysis and optimizations using several benchmarks, which were chosen to reflect the kinds of finegrained, communication-intensive phases that exist in some larger applications. The optimizations show speedups of up to 70% on three parallel systems, which represent three different types of cluster network technologies.

[1]  Dennis Shasha,et al.  Efficient and correct execution of parallel programs that share memory , 1988, TOPL.

[2]  David A. Padua,et al.  Concurrent Static Single Assignment Form and Constant Propagation for Explicitly Parallel Programs , 1997, LCPC.

[3]  C. Tseng,et al.  UPC Implementation of an Unbalanced Tree Search Benchmark , 2003 .

[4]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[5]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[6]  Prithviraj Banerjee,et al.  Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers , 1995, ICS '95.

[7]  Mahmut T. Kandemir,et al.  A global communication optimization technique based on data-flow analysis and linear algebra , 1999, TOPL.

[8]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[9]  Raymond Lo,et al.  A new algorithm for partial redundancy elimination based on SSA form , 1997, PLDI '97.

[10]  Barton P. Miller,et al.  What are race conditions?: Some issues and formalizations , 1992, LOPL.

[11]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[12]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[13]  Chau-Wen Tseng An optimizing Fortran D compiler for MIMD distributed-memory machines , 1993 .

[14]  Edith Schonberg,et al.  A Unified Framework for Optimizing Communication in Data-Parallel Programs , 1996, IEEE Trans. Parallel Distributed Syst..

[15]  Charles Wallace,et al.  The UPC memory model: problems and prospects , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[16]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[17]  Jong-Deok Choi,et al.  Global communication analysis and optimization , 1996, PLDI '96.

[18]  Leonid Oliker,et al.  Message passing vs. shared address space on a cluster of SMPs , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[19]  Katherine Yelick,et al.  A proposal for a UPC memory consistency model, v1.0 , 2004 .

[20]  Katherine A. Yelick,et al.  Analyses and Optimizations for Shared Address Space Programs , 1996, J. Parallel Distributed Comput..

[21]  Laurie J. Hendren,et al.  Communication optimizations for parallel C programs , 1998, J. Parallel Distributed Comput..

[22]  John M. Mellor-Crummey,et al.  Effective communication coalescing for data-parallel applications , 2005, PPOPP.

[23]  Edith Schonberg,et al.  An HPF Compiler for the IBM SP2 , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[24]  Katherine A. Yelick,et al.  A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.

[25]  Raymond Lo,et al.  Effective Representation of Aliases and Indirect Memory Operations in SSA Form , 1996, CC.

[26]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[27]  David A. Padua,et al.  Basic compiler algorithms for parallel programs , 1999, PPoPP '99.

[28]  Tarek A. El-Ghazawi,et al.  UPC Performance and Potential: A NPB Experimental Study , 2002, ACM/IEEE SC 2002 Conference (SC'02).