Optimizing irregular shared-memory applications for distributed-memory systems

In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming to distributed-memory platforms by automatic translation of OpenMP programs to MPI. In the case of irregular applications, the performance of this translation scheme is limited by the fact that accesses to shared-data cannot be accurately resolved at compile-time. Additionally, irregular applications with high communication to computation ratios pose challenges even for direct implementation on message passing systems. In this paper, we present combined compile-time/run-time techniques for optimizing irregular shared-memory applications on message passing systems in the context of automatic translation from OpenMP to MPI. Our transformations enable computation-communication overlap by restructuring irregular parallel loops. The compiler creates inspectors to analyze actual data access patterns for irregular accesses at runtime. This analysis is combined with the compile-time analysis of regular data accesses to determine which iterations of irregular loops access non-local data. The iterations are then reordered to enable computation-communication overlap. In the case where the irregular access occurs inside nested loops, the loop nest is restructured. We evaluate our techniques by translating OpenMP versions of three benchmarks from two important classes of irregular applications - sparse matrix computations and molecular dynamics. We find that for these applications, on sixteen nodes, versions employing computation-communication overlap are almost twice as fast as baseline OpenMP-to-MPI versions, almost 30% faster than inspector-only versions, almost 25% faster than hand-coded versions on two applications and about 9% slower on the third.

[1]  Openmp: a Proposed Industry Standard Api for Shared Memory Programming , 2022 .

[2]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[3]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[4]  Frederica Darema,et al.  A single-program-multiple-data computational model for EPEX/FORTRAN , 1988, Parallel Comput..

[5]  Katherine Yelick,et al.  Titanium Language Reference Manual , 2001 .

[6]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[7]  Rudolf Eigenmann,et al.  Towards automatic translation of OpenMP to MPI , 2005, ICS '05.

[8]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[9]  Chau-Wen Tseng,et al.  A Comparison of Locality Transformations for Irregular Codes , 2000, LCR.

[10]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[11]  Harry Berryman,et al.  Distributed Memory Compiler Design for Sparse Problems , 1995, IEEE Trans. Computers.

[12]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[13]  Rudolf Eigenmann,et al.  SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance , 2001, WOMPAT.

[14]  Hironori Kasahara,et al.  Data-localization for Fortran macro-dataflow computation using partial static task assignment , 1996, ICS '96.

[15]  M. Karplus,et al.  CHARMM: A program for macromolecular energy, minimization, and dynamics calculations , 1983 .

[16]  Rudolf Eigenmann,et al.  Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[17]  Jimmy Su,et al.  Automatic support for irregular computations in a high-level language , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[18]  Joel H. Saltz,et al.  ICASE Report No . 92-12 / iVG / / ff 3 J / ICASE THE DESIGN AND IMPLEMENTATION OF A PARALLEL UNSTRUCTURED EULER SOLVER USING SOFTWARE PRIMITIVES , 2022 .

[19]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[20]  Gagan Agrawal,et al.  Porting and performance evaluation of irregular codes using OpenMP , 2000, Concurr. Pract. Exp..

[21]  Rudolf Eigenmann,et al.  Optimizing OpenMP Programs on Software Distributed Shared Memory Systems , 2004, International Journal of Parallel Programming.

[22]  José E. Moreira,et al.  An Overview Of The Bluegene/L System Software Organization , 2003, Parallel Process. Lett..

[23]  Ken Kennedy,et al.  Loop distribution with arbitrary control flow , 1990, Proceedings SUPERCOMPUTING '90.

[24]  Greg Burns,et al.  LAM: An Open Cluster Environment for MPI , 2002 .

[25]  Alan L. Cox,et al.  Compiler and software distributed shared memory support for irregular applications , 1997, PPOPP '97.

[26]  G. Liu,et al.  Overlap of Computation and Communication on Shared-Memory , 1999, Scalable Comput. Pract. Exp..

[27]  Dimitri J. Mavriplis,et al.  The design and implementation of a parallel unstructured Euler solver using software primitives , 1992 .

[28]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[29]  David A. Padua,et al.  Array privatization for shared and distributed memory machines (extended abstract) , 1993, SIGP.

[30]  Edith Schonberg,et al.  An HPF Compiler for the IBM SP2 , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[31]  Katherine A. Yelick,et al.  Optimizing Sparse Matrix Computations for Register Reuse in SPARSITY , 2001, International Conference on Computational Science.

[32]  Toshio Nakatani,et al.  A Loop Transformation Algorithm for Communication Overlapping , 2004, International Journal of Parallel Programming.

[33]  Horst D. Simon,et al.  Partitioning of unstructured problems for parallel processing , 1991 .

[34]  Matt W. Mutka,et al.  Enabling unimodular transformations , 1994, Proceedings of Supercomputing '94.

[35]  SkjellumAnthony,et al.  A high-performance, portable implementation of the MPI message passing interface standard , 1996 .

[36]  Prithviraj Banerjee,et al.  Techniques to overlap computation and communication in irregular iterative applications , 1994, ICS '94.