Quantifying the Multi-level Nature of Tiling Interactions

Optimizations, including tiling, often target a single level of memory or parallelism, such as cache. These optimizations usually operate on a level-by-level basis, guided by a cost function parameterized by features of that single level. The benefit of optimizations guided by these one-level cost functions decreases as architectures tend towards a hierarchy of memory and of parallelism. We have identified three common architectural scenarios where a single tiling choice could be improved by using information from multiple levels in concert. For the first two scenarios, we derive multi-level cost functions which guide the optimal choice of tile size and shape, and quantify the improvement gained. We give both analysis and simulation results to support our points. For the third scenario, we summarize our findings.

[1]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[2]  Ken Kennedy,et al.  Compiler blockability of numerical algorithms , 1992, Proceedings Supercomputing '92.

[3]  Ken Kennedy,et al.  Optimizing for parallelism and data locality , 1992 .

[4]  Bowen Alpern,et al.  Hierarchical Tiling: A Methodology for High Performance , 1996 .

[5]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[6]  Larry Carter,et al.  Determining the idle time of a tiling , 1997, POPL '97.

[7]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[8]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[9]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[10]  TimePaul FeautrierLaboratoire Masi Some Eecient Solutions to the Aane Scheduling Problem Part I One-dimensional Time , 1993 .

[11]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[12]  Jacqueline Chame,et al.  The combined effectiveness of unimodular transformations, tiling, and software prefetching , 1996, Proceedings of International Conference on Parallel Processing.

[13]  Ken Kennedy,et al.  Improving the ratio of memory operations to floating-point operations in loops , 1994, TOPL.

[14]  William Pugh,et al.  A unifying framework for iteration reordering transformations , 1995, Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing.

[15]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[16]  Vivek Sarkar,et al.  A general framework for iteration-reordering loop transformations , 1992, PLDI '92.

[17]  P. Feautrier Some Eecient Solutions to the Aane Scheduling Problem Part I One-dimensional Time , 1996 .

[18]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[19]  A PaduaDavid,et al.  Advanced compiler optimizations for supercomputers , 1986 .

[20]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[21]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[22]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[23]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[24]  Steve Carr,et al.  Combining optimization for cache and instruction-level parallelism , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[25]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[26]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[27]  Dennis Gannon,et al.  Applying AI Techniques to Program Optimization for Parallel Computers , 1987 .

[28]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[29]  Vivek Sarkar,et al.  Locality Analysis for Distributed Shared-Memory Multiprocessors , 1996, LCPC.

[30]  J. Ramanujam,et al.  Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[31]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[32]  Larry Carter,et al.  Efficient Parallelism via Hierarchical Tiling , 1995, PPSC.

[33]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.