Increasing the Applicability of Scalar Replacement

This paper describes an algorithm for scalar replacement, which replaces repeated accesses to an array element with a scalar temporary. The element is accessed from a register rather than memory, thereby eliminating unnecessary memory accesses. A previous approach to this problem combines scalar replacement with a loop transformation called unroll-and-jam, whereby outer loops in a nest are unrolled, and the resulting duplicate inner loop bodies are fused together. The effect of unroll-and-jam is to bring opportunities for scalar replacement into inner loop bodies. In this paper, we describe an alternative approach that can exploit reuse opportunities across multiple loops in a nest, and without requiring unroll-and-jam. We also use this technique to eliminate unnecessary writes back to memory. The approach described in this paper is particularly well-suited to architectures with large register files and efficient mechanisms for register-to-register transfer. From our experimental results mapping 5 multimedia kernels to an FPGA platform, assuming 32 registers, we observe a 58 to 90 percent of reduction in memory accesses and speedup 2.34 to 7.31 over original programs.

[1]  Steven J. Deitz,et al.  Eliminating redundancies in sum-of-product array computations , 2001, ICS '01.

[2]  Pedro C. Diniz,et al.  Using estimates from behavioral synthesis tools in compiler-directed design space exploration , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[3]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[4]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, SIGP.

[5]  Vivek Sarkar,et al.  Unified Analysis of Array and Object References in Strongly Typed Languages , 2000, SAS.

[6]  Mary Hall,et al.  An efficient design space exploration for balance between computation and memory , 2003 .

[7]  Pedro C. Diniz,et al.  A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.

[8]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[9]  Nikil D. Dutt,et al.  Elimination of redundant memory traffic in high-level synthesis , 1996, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[10]  Wei Li,et al.  Inter-procedural loop fusion, array contraction and rotation , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[11]  Allen,et al.  Optimizing Compilers for Modern Architectures , 2004 .

[12]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[13]  Keshav Pingali,et al.  An experimental evaluation of tiling and shackling for memory hierarchy management , 1999, ICS '99.