Communication optimizations for parallel C programs

This paper presents algorithms for reducing the communication overhead for parallel C programs that use dynamically-allocated data structures. The framework consists of an analysis phase called possible-placement analysis, and a transformation phase called communication selection.The fundamental idea of possible-placement analysis is to find all possible points for insertion of remote memory operations. Remote reads are propagated upwards, whereas remote writes are propagated downwards. Based on the results of the possible-placement analysis, the communication selection transformation selects the "best" place for inserting the communication, and determines if pipelining or blocking of communication should be performed.The framework has been implemented in the EARTH-McCAT optimizing/parallelizing C compiler, and experimental results are presented for five pointer-intensive benchmarks running on the EARTH-MANNA distributed-memory parallel architecture. These experiments show that the communication optimization can provide performance improvements of up to 16% over the unoptimized benchmarks.

[1]  Bernhard Steffen,et al.  Optimal code motion: theory and practice , 1994, TOPL.

[2]  Martin C. Rinard,et al.  Synchronization transformations for parallel computing , 1999, POPL '97.

[3]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[4]  Laurie J. Hendren,et al.  Putting pointer analysis to work , 1998, POPL '98.

[5]  Joel H. Saltz,et al.  Interprocedural partial redundancy elimination and its application to distributed memory compilation , 1995, PLDI '95.

[6]  Laurie J. Hendren,et al.  Context-sensitive interprocedural points-to analysis in the presence of function pointers , 1994, PLDI '94.

[7]  Guang R. Gao,et al.  Compiling C for the EARTH multithreaded architecture , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[8]  SaltzJoel,et al.  Interprocedural partial redundancy elimination and its application to distributed memory compilation , 1995 .

[9]  Ken Kennedy,et al.  GIVE-N-TAKE—a balanced code placement framework , 1994, PLDI '94.

[10]  Laurie J. Hendren,et al.  Taming control flow: a structured approach to eliminating goto statements , 1994, Proceedings of 1994 IEEE International Conference on Computer Languages (ICCL'94).

[11]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[12]  C MowryTodd,et al.  Compiler-based prefetching for recursive data structures , 1996 .

[13]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[14]  Katherine A. Yelick,et al.  Optimizing parallel programs with explicit synchronization , 1995, PLDI '95.

[15]  H.H.J. Hum,et al.  Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[16]  Anne Rogers,et al.  Software caching and computation migration in Olden , 1995, PPOPP '95.

[17]  Martin C. Carlisle,et al.  Olden: parallelizing programs with dynamic data structures on distributed-memory machines , 1996 .

[18]  Laurie J. Hendren,et al.  Locality analysis for parallel C programs , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[19]  Jong-Deok Choi,et al.  Global communication analysis and optimization , 1996, PLDI '96.

[20]  Guang R. Gao,et al.  Heap analysis and optimizations for threaded programs , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[21]  Wolfgang K. Giloi,et al.  Latency hiding in message-passing architectures , 1994, Proceedings of 8th International Parallel Processing Symposium.