Unroll-and-jam using uniformly generated sets

Modern architectural trends in instruction-level parallelism (ILP) are to increase the computational power of microprocessors significantly. As a result the demands on memory have increased. Unfortunately, memory systems have not kept pace. Even hierarchical cache structures are ineffective if programs do not exhibit cache locality. Because of this compilers need to be concerned not only with finding ILP to utilize machine resources effectively, but also with ensuring that the resulting code has a high degree of cache locality. One compiler transformation that is essential for a compiler to meet the above objectives is unroll-and-jam, or outer-loop unrolling. Previous work either has used a dependence-based model to compute unroll amounts, significantly increasing the size of the dependence graph, or has applied a more brute force technique. In this paper, we present an algorithm that uses a linear-algebra-based technique to compute unroll amounts. This technique results in an 84% reduction over dependence-based techniques in the total number of dependences needed in our benchmark suite. Additionally, there is no loss in optimization performance over previous techniques and a more elegant solution is utilized.

[1]  Ken Kennedy,et al.  A Parallel Programming Environment , 1985, IEEE Software.

[2]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[3]  A. Aiken,et al.  Loop Quantization: an Analysis and Algorithm , 1987 .

[4]  Steven Mark Carr,et al.  Memory-hierarchy management , 1993 .

[5]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[6]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[7]  David S. Wise Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation , 1991, PLDI 1991.

[8]  Ken Kennedy,et al.  Parallel Programming Support in ParaScope , 1988, Parallel Computing in Science and Engineering.

[9]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[10]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[11]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[12]  Yiping Guan Unroll-And-Jam Guided by A Linear-Algebra-Based Data-Reuse Model , 1995 .

[13]  Ken Kennedy,et al.  Parascope:a Parallel Programming Environment , 1988 .

[14]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[15]  Steve Carr,et al.  Combining optimization for cache and instruction-level parallelism , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[16]  Ken Kennedy,et al.  Scalar replacement in the presence of conditional control flow , 1994, Softw. Pract. Exp..