论文信息 - Increasing the Applicability of Scalar Replacement

Increasing the Applicability of Scalar Replacement

This paper describes an algorithm for scalar replacement, which replaces repeated accesses to an array element with a scalar temporary. The element is accessed from a register rather than memory, thereby eliminating unnecessary memory accesses. A previous approach to this problem combines scalar replacement with a loop transformation called unroll-and-jam, whereby outer loops in a nest are unrolled, and the resulting duplicate inner loop bodies are fused together. The effect of unroll-and-jam is to bring opportunities for scalar replacement into inner loop bodies. In this paper, we describe an alternative approach that can exploit reuse opportunities across multiple loops in a nest, and without requiring unroll-and-jam. We also use this technique to eliminate unnecessary writes back to memory. The approach described in this paper is particularly well-suited to architectures with large register files and efficient mechanisms for register-to-register transfer. From our experimental results mapping 5 multimedia kernels to an FPGA platform, assuming 32 registers, we observe a 58 to 90 percent of reduction in memory accesses and speedup 2.34 to 7.31 over original programs.

Mary W. Hall | Byoungro So | Byoungro So

[1] Steven J. Deitz,et al. Eliminating redundancies in sum-of-product array computations , 2001, ICS '01.

[2] Pedro C. Diniz,et al. Using estimates from behavioral synthesis tools in compiler-directed design space exploration , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[3] Chau-Wen Tseng,et al. Improving data locality with loop transformations , 1996, TOPL.

[4] Ken Kennedy,et al. Improving register allocation for subscripted variables , 1990, SIGP.

[5] Vivek Sarkar,et al. Unified Analysis of Array and Object References in Strongly Typed Languages , 2000, SAS.

[6] Mary Hall,et al. An efficient design space exploration for balance between computation and memory , 2003 .

[7] Pedro C. Diniz,et al. A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.

[8] William J. Dally,et al. A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[9] Nikil D. Dutt,et al. Elimination of redundant memory traffic in high-level synthesis , 1996, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[10] Wei Li,et al. Inter-procedural loop fusion, array contraction and rotation , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[11] Allen,et al. Optimizing Compilers for Modern Architectures , 2004 .

[12] Ken Kennedy,et al. Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[13] Keshav Pingali,et al. An experimental evaluation of tiling and shackling for memory hierarchy management , 1999, ICS '99.