A case for a working-set-based memory hierarchy

Modern microprocessor designs continue to obtain impressive performance gains through increasing clock rates and advances in the parallelism obtained via micro-architecture design. Unfortunately, corresponding improvements in memory design technology have not been realized, resulting in latencies of over 100 cycles between processors and main memory. This ever-increasing gap in speed has pushed the current memory-hierarchy approach to its limit.Traditional approaches to memory-hierarchy management have not yielded satisfactory results. Hardware solutions require more power and energy than desired and do not scale well. Compiler solutions tend to miss too many optimization opportunities because of limited compile-time knowledge of run-time behavior. This paper explores a different approach that combines both approaches by making use of the static knowledge obtained by the compiler in the dynamic decision making of the micro-architecture. We propose a memory-hierarchy design based on working sets that uses compile-time annotations regarding the working set of memory operations to guide cache placement decisionsOur experiments show that a working-set-based memory hierarchy can significantly reduce the miss rate for memory-intensive tiled kernels by limiting cross interference. The working-set-based memory hierarchy allows the compiler to tile many loops without concern for cross interference in the cache, making tile size choice easier. In addition, the compiler can more easily tailor tile choices to the separate needs of different working sets.

[1]  Jaejin Lee,et al.  Using prime numbers for cache indexing to eliminate conflict misses , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[2]  Chau-Wen Tseng,et al.  A Comparison of Compiler Tiling Algorithms , 1999, CC.

[3]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[4]  Zhiyuan Li,et al.  New tiling techniques to improve cache temporal locality , 1999, PLDI '99.

[5]  Qing Yang,et al.  A novel cache design for vector processing , 1992, ISCA '92.

[6]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[7]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[8]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[9]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[10]  Vivek Sarkar,et al.  Optimization of array accesses by collective loop transformations , 1991, ICS '91.

[11]  Dennis Gannon,et al.  Strategies for cache and local memory management by global program transformation , 1988, J. Parallel Distributed Comput..

[12]  Steve Carr,et al.  Compiler blockability of dense matrix factorizations , 1997, TOMS.

[13]  José González,et al.  The design and performance of a conflict-avoiding cache , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[14]  V. Sarkar,et al.  Collective Loop Fusion for Array Contraction , 1992, LCPC.

[15]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[16]  François Bodin,et al.  Skewed associativity enhances performance predictability , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[17]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, SIGP.

[18]  Ken Kennedy Fast greedy weighted fusion , 2000, ICS '00.

[19]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[20]  Mahmut T. Kandemir,et al.  A compiler algorithm for optimizing locality in loop nests , 1997, ICS '97.

[21]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[22]  Ken Kennedy,et al.  Vector Register Allocation , 1992, IEEE Trans. Computers.

[23]  Olivier Temam,et al.  A quantitative analysis of loop nest locality , 1996, ASPLOS VII.

[24]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[25]  Josep Llosa,et al.  Optimizing program locality through CMEs and GAs , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.