Optimized Unrolling of Nested Loops

Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).

[1]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[2]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[3]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[4]  Bruce R. Childers,et al.  Memory bandwidth optimizations for wide-bus machines , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[5]  Vicki H. Allan,et al.  Software pipelining: an evaluation of enhanced pipelining , 1991, MICRO 24.

[6]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[7]  Alexandru Nicolau,et al.  Parallel processing: a smart compiler and a dumb machine , 1984, SIGP.

[8]  Steve Carr,et al.  Unroll-and-jam using uniformly generated sets , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[9]  Vivek Sarkar,et al.  Automatic selection of high-order transformations in the IBM XL FORTRAN compilers , 1997, IBM J. Res. Dev..

[10]  Jian Wang,et al.  GURPR—a method for global software pipelining , 1987, MICRO 20.

[11]  Jack J. Dongarra,et al.  Unrolling loops in fortran , 1979, Softw. Pract. Exp..

[12]  James E. Smith,et al.  A study of scalar compilation techniques for pipelined supercomputers , 1987, ASPLOS.

[13]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[14]  Vivek Sarkar,et al.  Determining average program execution times and their variance , 1989, PLDI '89.

[15]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[16]  Vivek Sarkar,et al.  An optimal asynchronous scheduling algorithm for software cache consistency , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[17]  Ken Kennedy,et al.  Scalar replacement in the presence of conditional control flow , 1994, Softw. Pract. Exp..

[18]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[19]  V. Sarkar,et al.  Automatic partitioning of a program dependence graph into parallel tasks , 1991, IBM J. Res. Dev..

[20]  Wen-mei W. Hwu,et al.  Unrolling-based optimizations for modulo scheduling , 1995, MICRO 1995.

[21]  Vivek Sarkar,et al.  A general framework for iteration-reordering loop transformations , 1992, PLDI '92.

[22]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[23]  Sanjay Jinturkar,et al.  Aggressive Loop Unrolling in a Retargetable Optimizing Compiler , 1996, CC.