Communication-Minimal Tiling of Uniform Dependence Loops

Tiling is a loop transformation that the compiler uses to create automatically blocked algorithms in order to improve the benefits of the memory hierarchy and reduce the communication overhead between processors. Motivated by existing results, this paper presents a conceptually simple approach to finding tilings with a minimal amount of communication between tiles. The development of all results is based primarily on the inequality of arithmetic and geometric means, except for Lemma 8 whose proof relies on the concept of extremal rays of convex cones. The key insight is mat a tiling that is communication-minimal must induce the same amount of communication through all faces of a tile, which restricts the search space for optimal tilings to those tiling matrices whose rows are all extremal rays in a cone. For nested loops with several special forms of dependences, closed-form optimal tilings are derived. In the general case, a procedure is given that always returns optimal tilings. A detailed comparison of this work with some existing results is provided.

[1]  Corinne Ancourt,et al.  Minimal Data Dependence Abstractions for Loop Transformations , 1994, LCPC.

[2]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[3]  Vivek Sarkar,et al.  Locality Analysis for Distributed Shared-Memory Multiprocessors , 1996, LCPC.

[4]  Hiroshi Ohta,et al.  Optimal tile size adjustment in compiling general DOACROSS loop nests , 1995, ICS '95.

[5]  William Jalby,et al.  Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design , 1988 .

[6]  Chung-Ta King,et al.  Grouping in Nested Loops for Parallel Execution on Multicomputers , 1989, International Conference on Parallel Processing.

[7]  D. Sorensen,et al.  Block reduction of matrices to condensed forms for eigenvalue computations , 1990 .

[8]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[9]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[10]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[11]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[12]  Gene H. Golub,et al.  Matrix computations , 1983 .

[13]  Jack Dongarra,et al.  Automatic Blocking of Nested Loops , 1990 .

[14]  Yves Robert,et al.  (Pen)-ultimate tiling? , 1994, Integr..

[15]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[16]  Jingling Xue,et al.  On Tiling as a Loop Transformation , 1997, Parallel Process. Lett..

[17]  Anne Rogers,et al.  Compiling for Distributed Memory Architectures , 1994, IEEE Trans. Parallel Distributed Syst..

[18]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[19]  David K. Smith Theory of Linear and Integer Programming , 1987 .

[20]  Utpal Banerjee Loop Parallelization , 1994, Springer US.

[21]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[22]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[23]  Ken Kennedy,et al.  Cross-Loop Reuse Analysis and Its Application to Cache Optimizations , 1996, LCPC.