The combined effectiveness of unimodular transformations, tiling, and software prefetching

Unimodular transformations, tiling, and software prefetching are loop optimizations known to be effective in increasing parallelism, reducing cache miss rates, and eliminating processor stall time. Although these optimizations individually are quite effective, there is the expectation that even better improvements can be obtained by combining them together. In this paper we show that indeed this is the case when unimodular transformations are combined with either tiling or software prefetching. However, our results also show that although combining tiling with prefetching tends to improve the performance of tiling alone, it is also the case that in some situations tiling can degrade the cache performance of software prefetching. The reasons for this unexpected behavior are three fold: 1) tiling introduces interference misses inside the localized space which are difficult to characterize with current techniques based on locality analysis; 2) prefetch predicates are computed using only estimates on the amount of capacity misses, so the latency induced by cache interference is not completely covered; and 3) tiling limits the maximum amount of latency that can be masked with prefetching.

[1]  King-Sun Fu,et al.  Data Coherence Problem in a Multicache System , 1985, IEEE Transactions on Computers.

[2]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[3]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[4]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[5]  Jack Dongarra,et al.  LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[6]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[7]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[8]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[9]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .

[10]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[11]  Michael Wolfe,et al.  Iteration Space Tiling for Memory Hierarchies , 1987, PPSC.

[12]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[13]  Michel Dubois,et al.  Concurrent Miss Resolution in Multiprocessor Caches , 1988, ICPP.

[14]  Calvin K. Tang Cache system design in the tightly coupled multiprocessor system , 1976, AFIPS '76.

[15]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[16]  Pen-Chung Yew,et al.  : Data Prefetching In Shared Memory Multiprocessors , 1987, ICPP.

[17]  Robert H. Halstead,et al.  MASA: a multithreaded processor architecture for parallel symbolic computing , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[18]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[19]  Anant Agarwal,et al.  APRIL: a processor architecture for multiprocessing , 1990, ISCA '90.

[20]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[21]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[22]  W. K. George,et al.  University of Illinois at Urbana-Champain , 1997 .

[23]  Walid Abu-Sufah,et al.  Improving the performance of virtual memory computers. , 1979 .

[24]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[25]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[26]  Olivier Temam,et al.  To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[27]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[28]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[29]  .May ..id University of Illinois , 1919, The Grants Register 2021.

[30]  Jack J. Dongarra,et al.  Solving linear systems on vector and shared memory computers , 1990 .

[31]  Michel Dubois,et al.  Synchronization, coherence, and event ordering in multiprocessors , 1988, Computer.

[32]  John Randal Allen,et al.  Dependence analysis for subscripted variables and its application to program transformations , 1983 .