Improving locality and parallelism in nested loops

Researchers have identified a core set of program transformations that are effective in array-based loop nest optimization: these loop transformations include interchange, skewing, reversal and tiling. Researchers have studied these transformations individually for their legality and effect on parallelism and memory hierarchy performance; but they have not discussed in any detail how to choose the combination of transformations that best optimizes a loop nest. Other researchers have taken another approach: they consider each loop nest as a whole, applying an elegant matrix theory of loop nest transformation, but one that is only applicable to a limited class of loop nests, those whose dependences can be expressed as distance vectors. In this limited context, the problems of memory hierarchy improvement and parallelization are simplified, but their approach has not been extended to apply to general loop nests. We have combined the elegance of the matrix theory with the generality of general dependence vectors into a new theory of loop transformation. This theory has enabled us to apply an algorithmic approach to solving optimization goals. Using this theory, we have developed efficient algorithms for the compiler to use to improve memory hierarchy utilization and parallelism of general loop nests. The parallelization improving algorithm maximizes the degree of parallelism within a loop nest, at either a coarse or fine granularity. The locality improving algorithm uses the same theory, and also reuse information about array accesses within loop nests, to guide the transformation process. The parallelization and locality improvement algorithms are unified so that locality and parallelism can be improved simultaneously without significantly reducing either. We have implemented versions of these algorithms in Stanford's SUIF compiler and performed experimentation on the Perfect Club and the NASA kernels. We have found compiler locality improvement to significantly improve performance when applicable. We have also demonstrated a tremendous sensitivity of performance on tile size for tiled codes on machines with direct-mapped or low set associativity caches.

[1]  A. Aiken,et al.  Loop Quantization: an Analysis and Algorithm , 1987 .

[2]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[3]  Michael L. Dowling Optimal code parallelization using unimodular transformations , 1990, Parallel Comput..

[4]  Steven W. K. Tjiang,et al.  Automatic generation of data-flow analyzers : a tool for building optimizers , 1993 .

[5]  J. Ramanujam,et al.  Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[6]  W. ABU-SUFAH,et al.  Automatic program transformations for virtual memory computers * , 1899, 1979 International Workshop on Managing Requirements Knowledge (MARK).

[7]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[8]  M. Schlansker,et al.  The Cydra 5 computer system architecture , 1988, Proceedings 1988 IEEE International Conference on Computer Design: VLSI.

[9]  Ken Kennedy,et al.  PFC: A Program to Convert Fortran to Parallel Form , 1982 .

[10]  Ronald Gary Cytron Compile-time scheduling and optimization for asynchronous machines (multiprocessor, compiler, parallel processing) , 1984 .

[11]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[12]  Dan I. Moldovan,et al.  Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[13]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[14]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[15]  J M Delosme,et al.  Efficient Systolic Arrays for the Solution of Toeplitz Systems: An Illustration of a Methodology for the Construction of Systolic Architectures in VLSI (Very Large Systems Integration). , 1985 .

[16]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[17]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[18]  Ken Kennedy,et al.  Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..

[19]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[20]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[21]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[22]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[23]  James L. Elshoff Some programming techniques for processing multi-dimensional matrices in a paging environment , 1974, AFIPS '74.

[24]  Dan I. Moldovan,et al.  Parallelism detection and transformation techniques useful for VLSI algorithms , 1985, J. Parallel Distributed Comput..

[25]  Donald A. Calahan,et al.  Block-Oriented, Local-Memory Based Linear Equation Solution on the Cray-2 Uniprocessor Algorithms , 1986, ICPP.

[26]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[27]  A. C. McKellar,et al.  The organization of matrices and matrix operations in a paged multiprogramming environment , 1968 .

[28]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[29]  Jack Dongarra,et al.  Automatic Blocking of Nested Loops , 1990 .

[30]  Michael Wolfe,et al.  Beyond induction variables , 1992, PLDI '92.

[31]  Dennis Gannon,et al.  On the problem of optimizing data transfers for complex memory systems , 1988, ICS '88.

[32]  Patrice Quinton Automatic synthesis of systolic arrays from uniform recurrent equations , 1984, ISCA '84.

[33]  Chau-Wen Tseng,et al.  The Power Test for Data Dependence , 1992, IEEE Trans. Parallel Distributed Syst..

[34]  Duncan H. Lawrie,et al.  On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations , 1981, IEEE Transactions on Computers.

[35]  Walid Abu-Sufah,et al.  Improving the performance of virtual memory computers. , 1979 .

[36]  Martine Ancourt Generation automatique de codes de transfert pour multiprocesseurs a memoires locales , 1991 .

[37]  Monica S. Lam,et al.  Efficient and exact data dependence analysis , 1991, PLDI '91.

[38]  Dan I. Moldovan,et al.  ADVIS: A Software Package for the Design of Systolic Arrays , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[39]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[40]  William Jalby,et al.  Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design , 1988 .

[41]  Steven W. K. Tjiang,et al.  Integrating Scalar Optimization and Parallelization , 1991, LCPC.

[42]  O. A. Olukotun,et al.  Implementing a cache for a high-performance GaAs microprocessor , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[43]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.