Interprocedural Load Elimination for Dynamic Optimization of Parallel Programs

Load elimination is a classical compiler transformation that is increasing in importance for multi-core and many-core architectures. The effect of the transformation is to replace a memory access, such as a read of an object field or an array element, by a read of a compiler-generated temporary that can be allocated in faster and more energy-efficient storage structures such as registers and local memories (scratchpads). Unfortunately, current just-in-time and dynamic compilers perform load elimination only in limited situations. In particular, they usually make worst-case assumptions about potential side effects arising from parallel constructs and method calls. These two constraints interact with each other since parallel constructs are usually translated to low-level runtime library calls. In this paper, we introduce an interprocedural load elimination algorithm suitable for use in dynamic optimization of parallel programs. The main contributions of the paper include: a) an algorithm for load elimination in the presence of three core parallel constructs -- async, finish, and isolated, b) efficient side-effect analysis for method calls, c) extended side-effect analysis for parallel constructs using an Isolation Consistency memory model, and d) performance results to study the impact of load elimination on a set of standard benchmarks using an implementation of the algorithm in Jikes RVM for optimizing programs written in a subset of the X10 v1.5 language. Our performance results show decreases in dynamic counts for getfield operations of up to 99.99%, and performance improvements of up to 1.76x on 1 core, and 1.39x on 16 cores, when comparing the algorithm in this paper with the load elimination algorithm available in Jikes RVM.

[1]  Thomas R. Gross,et al.  Load Elimination in the Presence of Side Effects, Concurrency and Precise Exceptions , 2003, LCPC.

[2]  Dennis Shasha,et al.  Efficient and correct execution of parallel programs that share memory , 1988, TOPL.

[3]  Thomas C. Spillman,et al.  Exposing Side-Effects in a PL/I Optimizing Compiler , 1971, IFIP Congress.

[4]  Vivek Sarkar,et al.  Phasers: a unified deadlock-free construct for collective and point-to-point synchronization , 2008, ICS '08.

[5]  Ondrej Lhoták,et al.  Using Inter-Procedural Side-Effect Information in JIT Optimizations , 2005, CC.

[6]  Raymond Lo,et al.  Register promotion by sparse partial redundancy elimination of loads and stores , 1998, PLDI.

[7]  Milo M. K. Martin,et al.  Subtleties of transactional memory atomicity semantics , 2006, IEEE Computer Architecture Letters.

[8]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[9]  Ken Kennedy,et al.  Fast interprocedual alias analysis , 1989, POPL '89.

[10]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[11]  Ken Kennedy,et al.  Scalar replacement in the presence of conditional control flow , 1994, Softw. Pract. Exp..

[12]  Sarita V. Adve,et al.  Recent advances in memory consistency models for hardware shared memory systems , 1999, Proc. IEEE.

[13]  James R. Larus,et al.  Transactional Memory , 2006, Transactional Memory.

[14]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, SIGP.

[15]  Lars Ræder Clausen A Java Bytecode Optimizer Using Side-Effect Analysis , 1997, Concurr. Pract. Exp..

[16]  Experiences with an SMP Implementation for X 10 based on the Java Concurrency Utilities ( Extended Abstract ) , 2008 .

[17]  Vivek Sarkar,et al.  Unified Analysis of Array and Object References in Strongly Typed Languages , 2000, SAS.

[18]  Barbara G. Ryder,et al.  Interprocedural modification side effect analysis with pointer aliasing , 1993, PLDI '93.

[19]  Rajiv Gupta,et al.  Load-reuse analysis: design and evaluation , 1999, PLDI '99.

[20]  Vivek Sarkar,et al.  Location Consistency-A New Memory Model and Cache Consistency Protocol , 2000, IEEE Trans. Computers.

[21]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[22]  Thomas R. Gross,et al.  Static conflict analysis for multi-threaded object-oriented programs , 2003, PLDI '03.

[23]  Bronis R. de Supinski,et al.  The OpenMP Memory Model , 2005, IWOMP.

[24]  Frances E. Allen,et al.  Interprocedural Data Flow Analysis , 1974, IFIP Congress.

[25]  Vivek Sarkar,et al.  May-happen-in-parallel analysis of X10 programs , 2007, PPoPP.

[26]  Jeremy Manson,et al.  The Java memory model , 2005, POPL '05.

[27]  John Banning,et al.  : An Efficient , 2022 .