Optimizing Explicitly Parallel Programs

We present compiler optimization techniques for explicitly parallel programs that communicate through a shared address space. The source programs are written in a single program multiple data (SPMI) style, and our machine target is a multiprocessor with physically distributed memory and hardware or software support for a single address space. The source language involves normal read and write operations on the address space, which correspond either to local memory operations or to communications over an interconnect network. The remote operations result in high latencies, but much of the latency can be overlapped with local computation or initiation of further remote operations. Non-blocking memory operations allow this overlap to be expressed directly. However, overlap is difficult for programmers to do by hand; it can lead to subtle program errors, since the order in which operations complete is no longer obvious. Programmers writing explicitly parallel code expect reads and writes from a single thread to take effect in program order, a property called sequential consistency. The use of non-blocking memory operations might yield executions that violate sequential consistency. We provide a new algorithm for static program analysis to detect memory operations that can safely be made non-blocking. The analysis requires dependency information across and within threads, and builds on earlier work by Shasha and Snir. We improve their results by providing a more efficient algorithm for SPMD programs, and by improving the accuracy of the analysis through the use of synchronization information. Using the results of this analysis, we show how to optimize parallel programs by changing blocking operations into non-blocking ones, performing code motion to increase the time for communication overlap, and caching remote values to eliminate some read accesses entirely. We show the potential payoff from each of our optimizations on real applications, using hand-transformed programs. The experiments are done on a CM-5 multiprocessor using the Split-C runtime system, which provides a software implementation of a global address space and both blocking and non-blocking memory operations.

[1]  Keshav Pingali,et al.  I-structures: Data structures for parallel computing , 1986, Graph Reduction.

[2]  David A. Padua,et al.  Issues in the Optimization of Parallel Programs , 1990, ICPP.

[3]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[4]  Ron Cytron,et al.  An Overview of the PTRAN Analysis System for Multiprocessing , 1988, J. Parallel Distributed Comput..

[5]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[6]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[7]  Katherine Yelick,et al.  Data Structures for Irregular Applications , 1993 .

[8]  Seth Copen Goldstein,et al.  Active Messages: A Mechanism for Integrated Communication and Computation , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[9]  Samuel P. Midkiff,et al.  Compiling programs with user parallelism , 1990 .

[10]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[11]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[12]  Jaspal Subhlok,et al.  Static analysis of low-level synchronization , 1988, PADD '88.

[13]  ShashaDennis,et al.  Efficient and correct execution of parallel programs that share memory , 1988 .

[14]  Ken Kennedy,et al.  Compiler optimizations for Fortran D on MIMD distributed-memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[15]  J. Demmel,et al.  LAPACK: a portable linear algebra library for supercomputers , 1989, IEEE Control Systems Society Workshop on Computer-Aided Control System Design.

[16]  Dirk Grunwald,et al.  Data flow equations for explicitly parallel programs , 1993, PPOPP '93.