Optimizing parallel programs with dynamic data structures

Distributed memory parallel architectures support a memory model where some memory accesses are local, and thus inexpensive, while other memory accesses are remote, and potentially quite expensive. In order to achieve efficiency on such architectures, we need to reduce remote accesses. This is particularly challenging for applications that use dynamic data structures. In this thesis, I present two compiler techniques to reduce the overhead of remote memory accesses for dynamic data structure based applications: locality techniques and communication optimizations. Locality techniques include a static locality analysis, which statically estimates when an indirect reference via a pointer can be safely assumed to be a local access, and dynamic locality checks, which consists of runtime tests to identify local accesses. Communication techniques include: (1) code movement to issue remote reads earlier and writes later; (2) code transformations to replace repeated/redundant remote accesses with one access; and (3) transformations to block or pipeline a group of remote requests together. Both locality and communication techniques have been implemented and incorporated into our EARTH-McCAT compiler framework, and a series of experiments have been conducted to evaluate these techniques. The experimental results show that we are able to achieve up to 26% performance improvement with each technique alone, and up to 29% performance improvement when both techniques are applied together.

[1]  Wolfgang K. Giloi,et al.  Latency hiding in message-passing architectures , 1994, Proceedings of 8th International Parallel Processing Symposium.

[2]  Martin C. Carlisle,et al.  Olden: parallelizing programs with dynamic data structures on distributed-memory machines , 1996 .

[3]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[4]  Guang R. Gao,et al.  Designing the McCAT Compiler Based on a Family of Structured Intermediate Representations , 1992, LCPC.

[5]  Laurie J. Hendren,et al.  Extended SSA numbering: introducing SSA properties to languages with multi-level pointers , 1996, CASCON.

[6]  Bjarne Steensgaard,et al.  Points-to analysis in almost linear time , 1996, POPL '96.

[7]  Laurie J. Hendren,et al.  Connection Analysis: A Practical Interprocedural Heap Analysis for C , 1996, International Journal of Parallel Programming.

[8]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[9]  Laurie J. Hendren,et al.  Putting pointer analysis to work , 1998, POPL '98.

[10]  Joel H. Saltz,et al.  Interprocedural partial redundancy elimination and its application to distributed memory compilation , 1995, PLDI '95.

[11]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis framework for parallelizing compilers , 1996, PLDI '96.

[12]  Katherine A. Yelick,et al.  Optimizing parallel programs with explicit synchronization , 1995, PLDI '95.

[13]  Fritz Henglein,et al.  Efficient Type Inference for Higher-Order Binding-Time Analysis , 1991, FPCA.

[14]  Bernhard Steffen,et al.  Optimal code motion: theory and practice , 1994, TOPL.

[15]  Anne Rogers,et al.  Software Caching and Computation Migration in Olden , 1996, J. Parallel Distributed Comput..

[16]  Laurie J. Hendren,et al.  Detecting Parallelism in C Programs with Recursive Darta Structures , 1998, CC.

[17]  Laurie J. Hendren,et al.  Context-sensitive interprocedural points-to analysis in the presence of function pointers , 1994, PLDI '94.

[18]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[19]  Martin Rinard,et al.  Synchronization transformations for parallel computing , 1999, ACM-SIGACT Symposium on Principles of Programming Languages.

[20]  Guang R. Gao,et al.  Polling Watchdog: Combining Polling and Interrupts for Efficient Message Handling , 1996, International Symposium on Computer Architecture.

[21]  Ken Kennedy,et al.  GIVE-N-TAKE—a balanced code placement framework , 1994, PLDI '94.

[22]  Laurie J. Hendren,et al.  Taming control flow: a structured approach to eliminating goto statements , 1994, Proceedings of 1994 IEEE International Conference on Computer Languages (ICCL'94).

[23]  Guang R. Gao,et al.  Heap analysis and optimizations for threaded programs , 1997, Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques.

[24]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[25]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[26]  Laurie J. Hendren,et al.  Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C , 1996, POPL '96.

[27]  Jong-Deok Choi,et al.  Global communication analysis and optimization , 1996, PLDI '96.