Instruction combining for coalescing memory accesses using global code motion

Instruction combining is an optimization to replace a sequence of instructions with a more efficient instruction yielding the same result in a fewer machine cycles. When we use it for coalescing memory accesses, we can reduce the memory traffic by combining narrow memory references with contiguous addresses into a wider reference for taking advantage of a wide-bus architecture. Coalescing memory accesses can improve performance for two reasons: one by reducing the additional cycles required for moving data from caches to registers and the other by reducing the stall cycles caused by multiple outstanding memory access requests. Previous approaches for memory access coalescing focus only on array access instructions related to loop induction variables, and thus they miss many other opportunities. In this paper, we propose a new algorithm for instruction combining by applying global code motion to wider regions of the given program in search of more potential candidates. We implemented two optimizations for coalescing memory accesses, one combining two 32-bit integer loads and the other combining two single-precision floating-point loads, using our algorithm in the IBM Java™ JIT compiler for IA-64, and evaluated them by measuring the SPECjvm98 benchmark suite. In our experiment, we can improve the maximum performance by 5.5% with little additional compilation time overhead. Moreover, when we replace every declaration of double for an instance variable with float, we can improve the performance by 7.3% for the MolDyn benchmark in the JavaGrande benchmark suite. Our approach can be applied to a variety of architectures and to programming languages besides Java.

[1]  Toshio Nakatani,et al.  Partial redundancy elimination for access expressions by speculative code motion , 2004, Softw. Pract. Exp..

[2]  Rajiv Gupta,et al.  Path profile guided partial redundancy elimination using speculation , 1998, Proceedings of the 1998 International Conference on Computer Languages (Cat. No.98CB36225).

[3]  Jack W. Davidson,et al.  Memory access coalescing: a technique for eliminating redundant memory accesses , 1994, PLDI '94.

[4]  Toshio Nakatani,et al.  Preference-directed graph coloring , 2002, PLDI '02.

[5]  Junqiang Sun,et al.  Tms320c6000 cpu and instruction set reference guide , 2000 .

[6]  Erez Petrank,et al.  Mostly concurrent garbage collection revisited , 2003, OOPSLA '03.

[7]  Toshio Nakatani,et al.  Effective sign extension elimination , 2002, PLDI '02.

[8]  Bernhard Steffen,et al.  Optimal code motion: theory and practice , 1994, TOPL.

[9]  Jens Palsberg,et al.  Efficient spill code for SDRAM , 2003, CASES '03.

[10]  Amer Diwan,et al.  Partial redundancy elimination for access path expressions , 1999, Softw. Pract. Exp..

[11]  Raymond Lo,et al.  Strength Reduction via SSAPRE , 1998, CC.

[12]  Mariëlle den Hengst,et al.  Proceedings of the 34th annual Hawaii International Conference on System Sciences , 2001 .

[13]  Toshio Nakatani,et al.  “Combining” as a compilation technique for VLIW architectures , 1989, MICRO 22.

[14]  Max Hailperin,et al.  Cost-optimal code motion , 1998, TOPL.

[15]  Bruce R. Childers,et al.  Memory bandwidth optimizations for wide-bus machines , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[16]  J. Knoop,et al.  Lazy Strength Reduction , 1993 .

[17]  Robert L. Bernstein Multiplication by integer constants , 1986, Softw. Pract. Exp..

[18]  Vivek Sarkar,et al.  Unified Analysis of Array and Object References in Strongly Typed Languages , 2000, SAS.

[19]  Jaewook Shin,et al.  Compiler-controlled caching in superword register files for multimedia extension architectures , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[20]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[21]  Guy L. Steele,et al.  The Java Language Specification , 1996 .

[22]  Peter L. Montgomery,et al.  Division by invariant integers using multiplication , 1994, PLDI '94.

[23]  Toshio Nakatani,et al.  Effective null pointer check elimination utilizing hardware trap , 2000, SIGP.

[24]  R. N. Horspool,et al.  Partial redundancy elimination driven by a cost-benefit analysis , 1997, Proceedings of the Eighth Israeli Conference on Computer Systems and Software Engineering.

[25]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.