Global communication analysis and optimization

Reducing communication cost is crucial to achieving good performance on scalable parallel machines. This paper presents a new compiler algorithm for global analysis and optimization of communication in data-parallel programs. Our algorithm is distinct from existing approaches in that rather than handling loop-nests and array references one by one, it considers all communication in a procedure and their interactions under different placements before making a final decision on the placement of any communication. It exploits the flexibility resulting from this advanced analysis to eliminate redundancy, reduce the number of messages, and reduce contention for cache and communication buffers, all in a unified framework. In contrast, single loop-nest analysis often retains redundant communication, and more aggressive dataflow analysis on array sections can generate too many messages or cache and buffer contention. The algorithm has been implemented in the IBM pHPF compiler for High Performance Fortran. During compilation, the number of messages per processor goes down by as much as a factor of nine for some HPF programs. We present performance results for the IBM SP2 and a network of Sparc workstations (NOW) connected by a Myrinet switch. In many cases, the communication cost is reduced by a factor of two.

[1]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[2]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[3]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[4]  Thomas Brandes Adaptor: A Compilation System for Data Parallel Fortran Programs , 1994, Automatic Parallelization.

[5]  Michael F. P. O'Boyle,et al.  Compiler reduction of synchronisation in shared virtual memory systems , 1995, ICS '95.

[6]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[7]  Bernhard Steffen,et al.  Lazy code motion , 1992, PLDI '92.

[8]  Manish Gupta,et al.  A methodology for high-level synthesis of communication on multicomputers , 1992, ICS '92.

[9]  Joel H. Saltz,et al.  Interprocedural partial redundancy elimination and its application to distributed memory compilation , 1995, PLDI '95.

[10]  Vivek Sarkar PTRAN—the IBM parallel translation system , 1991 .

[11]  Marc Snir,et al.  The Communication Software and Parallel Environment of the IBM SP2 , 1995, IBM Syst. J..

[12]  Michael Gerndt,et al.  SUPERB: A tool for semi-automatic MIMD/SIMD parallelization , 1988, Parallel Comput..

[13]  Marina C. Chen,et al.  Compiling Communication-Efficient Programs for Massively Parallel Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[14]  David A. Patterson,et al.  Logp quantified: the case for low-overhead local area networks , 1995 .

[15]  Dennis G. Shea,et al.  The SP2 High-Performance Switch , 1995, IBM Syst. J..

[16]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[17]  Edith Schonberg,et al.  Static analysis to reduce synchronization costs in data-parallel programs , 1996, POPL '96.

[18]  Edith Schonberg,et al.  An HPF Compiler for the IBM SP2 , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[19]  Jong-Deok Choi,et al.  On the Efficient Engineering of Ambitious Program Analysis , 1994, IEEE Trans. Software Eng..

[20]  Geoffrey C. Fox,et al.  A Compilation Approach for Fortran 90D/HPF Compilers on Distributed Memory MIMD Computers , 1993 .

[21]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[22]  Ken Kennedy,et al.  Combining dependence and data-flow analyses to optimize communication , 1995, Proceedings of 9th International Parallel Processing Symposium.

[23]  Alexander V. Veidenbaum,et al.  Detecting redundant accesses to array data , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[24]  Ken Kennedy,et al.  GIVE-N-TAKE—a balanced code placement framework , 1994, PLDI '94.

[25]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[26]  Edith Schonberg,et al.  A Unified Framework for Optimizing Communication in Data-Parallel Programs , 1996, IEEE Trans. Parallel Distributed Syst..

[27]  Cliff Click,et al.  Global code motion/global value numbering , 1995, PLDI '95.