Eliminating conflict misses for high performance architectures

Many cache misses in scientific programs are due to conflicts caused by limited set associativity. Two data-layout transformations, interand intra-variable padding, can eliminate many confict misses at compile time. We present GROUPPAD, an inter-variable padding heuristic to preserve group reuse in stencil computations frequently found in scientific computations. We show padding can also improve performance in parallel programs. Our optimizations have been implemented and tested on a collection of kernels and programs for different cache and data sizes. Preliminary results demonstrate GROUPPAD is able to consistently preserve group reuse among the programs evaluated, though execution time improvements are small for actual problem and cache sizes tested. Padding improves performance of parallel versions of programs approximately the same magnitude as sequential versions of the same program.

[1]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[2]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[3]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[4]  Chau-Wen Tseng,et al.  Enhancing software DSM for compiler-parallelized applications , 1997, Proceedings 11th International Parallel Processing Symposium.

[5]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[6]  Olivier Temam,et al.  A quantitative analysis of loop nest locality , 1996, ASPLOS VII.

[7]  Michael F. P. O'Boyle,et al.  Non-singular data transformations: definition, validity and applications , 1997, ICS '97.

[8]  Brian N. Bershad,et al.  Avoiding conflict misses dynamically in large direct-mapped caches , 1994, ASPLOS VI.

[9]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[10]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[11]  Mateo Valero,et al.  Eliminating cache conflict misses through XOR-based placement functions , 1997, ICS '97.

[12]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[13]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[14]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[15]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[16]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[17]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[18]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[19]  Todd C. Mowry,et al.  Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[20]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[21]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[22]  David A. Wood,et al.  Cache profiling and the SPEC benchmarks: a case study , 1994, Computer.