Hierarchical tiling for improved superscalar performance

It takes more than a good algorithm to achieve high performance: inner-loop performance and data locality are also important. Tiling is a well-known method for parallelization and for improving data locality. However, tiling has the potential of being even more beneficial. At the finest granularity, it can be used to guide register allocation and instruction scheduling; at the coarsest level, it can help manage magnetic storage media. It also can be useful in overlapping data movement with computation, for instance by prefetching data from archival storage, disks and main memory into cache and registers, or by choreographing data movement between processors. Hierarchical tiling is a framework for applying both known tiling methods and new techniques to an expanded set of uses. It eases the burden on several compiler phases that are traditionally treated separately, such as scalar replacement, register allocation, generation of message passing calls, and storage mapping. By explicitly naming and copying data, it takes control of the mapping of data to memory and of the movement of data between processing elements and up and down the memory hierarchy. This paper focuses on using hierarchical tiling to exploit superscalar pipelined processors. On a simple example, it improves performance by a factor of 3, achieving perfect use of the superscalar processor's pipeline. Hierarchical tiling is presented here as a method of hand-tuning performance; while outside the scope of this paper, the ideas can be incorporated into an automatic preprocessor or optimizing compiler.<<ETX>>

[1]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[2]  Alan Jay Smith,et al.  Machine Characterization Based on an Abstract High-Level Language Machine , 1989, IEEE Trans. Computers.

[3]  Daniel A. Reed,et al.  Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems , 1987, IEEE Transactions on Computers.

[4]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[5]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[6]  Larry Carter,et al.  Efficient Parallelism via Hierarchical Tiling , 1995, PPSC.

[7]  William Jalby,et al.  Optimizing matrix operations on a parallel multiprocessor with a memory hierarchy , 1986 .

[8]  J. Ramanujam,et al.  Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[9]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[10]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[11]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[12]  Santosh G. Abraham,et al.  Compiler techniques for data partitioning of sequentially iterated parallel loops , 1990, ICS '90.

[13]  Fung F. Lee Partitioning of Regular Computation on Multiprocessor Systems , 1990, J. Parallel Distributed Comput..

[14]  James R. Larus,et al.  CICO: A Practical Shared-Memory Programming Performance Model , 1994 .

[15]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[16]  Bowen Alpern,et al.  Modeling parallel computers as memory hierarchies , 1993, Proceedings of Workshop on Programming Models for Massively Parallel Computers.

[17]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[18]  Santosh G. Abraham,et al.  Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherency Traffic , 1991, IEEE Trans. Parallel Distributed Syst..

[19]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.