Randomized Cache Placement for Eliminating Conflicts

Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.

[1]  D. Kroft Lockup-free instruction fetch/prefetch cache organization , 1981, ISCA '98.

[2]  José González,et al.  The design and performance of a conflict-avoiding cache , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  José González,et al.  Memory Address Prediction for Data Speculation , 1997, Euro-Par.

[4]  José González,et al.  Speculative execution via address prediction and data prefetching , 1997, ICS '97.

[5]  Mateo Valero,et al.  Eliminating cache conflict misses through XOR-based placement functions , 1997, ICS '97.

[6]  Sharad Malik,et al.  Cache miss equations: an analytical representation of cache misses , 1997, ICS '97.

[7]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[8]  James E. Smith,et al.  The performance potential of data dependence speculation and collapsing , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[9]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[10]  Gurindar S. Sohi,et al.  ARB: A Hardware Mechanism for Dynamic Reordering of Memory References , 1996, IEEE Trans. Computers.

[11]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[12]  V. Allan,et al.  Petri net versus modulo scheduling for software pipelining , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[13]  G. Sohi,et al.  Zero-cycle loads: microarchitecture support for reducing load latency , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[14]  Dionisios N. Pnevmatikatos,et al.  Streamlining data cache access with fast address calculation , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[15]  Doug Hunt,et al.  Advanced performance features of the 64-bit PA-8000 , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[16]  Brian N. Bershad,et al.  Avoiding conflict misses dynamically in large direct-mapped caches , 1994, ASPLOS VI.

[17]  A. Eustace,et al.  ATOM: a system for building customized program analysis tools , 1994, PLDI '94.

[18]  François Bodin,et al.  Skewed-associative Caches , 1993, PARLE.

[19]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[20]  A. Agarwal,et al.  Column-associative Caches: A Technique For Reducing The Miss Rate Of Direct-mapped Caches , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[21]  Trevor Mudge,et al.  Hardware support for hiding cache latency , 1993 .

[22]  B. Ramakrishna Rau,et al.  Pseudo-randomly interleaved memory , 1991, ISCA '91.

[23]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[24]  John P. Hayes,et al.  On randomly interleaved memories , 1990, Proceedings SUPERCOMPUTING '90.

[25]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[26]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[27]  Wen-Hann Wang,et al.  Organization And Performance Of A Two-level Virtual-real Cache Hierarchy , 1989, The 16th Annual International Symposium on Computer Architecture.

[28]  Anant Agarwal,et al.  Analysis of cache performance for operating systems and multiprogramming , 1989, The Kluwer international series in engineering and computer science.

[29]  A. Argawal,et al.  Cache performance of operating systems and multiprogramming , 1988 .

[30]  David T. Harper,et al.  Vector Access Performance in Parallel Memories Using a Skewed Storage Scheme , 1987, IEEE Transactions on Computers.

[31]  William Jalby,et al.  XOR-Schemes: A Flexible Data Organization in Parallel Memories , 1985, ICPP.

[32]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[33]  Duncan H. Lawrie,et al.  The Prime Memory System for Array Access , 1982, IEEE Transactions on Computers.

[34]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[35]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[36]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .