High Performance Cache Architectures to Support Dynamic Superscalar Microprocessors

Simple cache structures are not sufficient to provide the memory bandwidth needed by a dynamic superscalar computer, so more sophisticated memory hierarchies such as non-blocking and pipelined caches are required. To provide direction for the designers of modern high performance microprocessors, we investigate the performance tradeoffs of the combinations of cache size, blocking and non-blocking caches, and pipeline depth of caches within the memory subsystem of a dynamic superscalar processor for integer applications. The results show that the dynamic superscalar processor can hide about two-thirds of the additional latency of two and three pipelined caches, and that a non-blocking cache is always beneficial. A pipelined cache will only outperform a non-pipelined cache if the miss penalty and miss rates are large.

[1]  Gary S. Tyson,et al.  A study of single-chip processor/cache organizations for large numbers of transistors , 1994, ISCA '94.

[2]  Anoop Gupta,et al.  Hiding memory latency using dynamic scheduling in shared-memory multiprocessors , 1992, ISCA '92.

[3]  T. Mowry,et al.  Comparative evaluation of latency reducing and tolerating techniques , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.

[4]  Trevor Mudge,et al.  Performance optimization of pipelined primary cache , 1992, ISCA '92.

[5]  Trevor N. Mudge,et al.  Resource allocation in a high clock rate microprocessor , 1994, ASPLOS VI.

[6]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[7]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[8]  Edward McLellan The Alpha AXP Architecture and 21064 , 1993 .

[9]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[10]  Norman P. Jouppi,et al.  WRL Research Report 93/5: An Enhanced Access and Cycle Time Model for On-chip Caches , 1994 .

[11]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[12]  Zarka Cvetanovic,et al.  Characterization of Alpha AXP performance using TP and SPEC workloads , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[13]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[14]  Norman P. Jouppi Cache write policies and performance , 1993, ISCA '93.

[15]  Ann Marie Grizzaffi Maynard,et al.  Contrasting characteristics and cache performance of technical and multi-user commercial workloads , 1994, ASPLOS VI.

[16]  Andrew R. Pleszkun,et al.  Implementation of precise interrupts in pipelined processors , 1985, ISCA '98.

[17]  Mark Horowitz,et al.  Performance tradeoffs in cache design , 1988, ISCA '88.

[18]  Thomas M. Conte Tradeoffs in processor/memory interfaces for superscalar processors , 1992, MICRO.

[19]  Alan Jay Smith,et al.  Branch Prediction Strategies and Branch Target Buffer Design , 1995, Computer.

[20]  Chung-Ho Chen,et al.  A unified architectural tradeoff methodology , 1994, ISCA '94.

[21]  Dionisios N. Pnevmatikatos,et al.  Cache performance of the SPEC92 benchmark suite , 1993, IEEE Micro.

[22]  Jeff Yetter,et al.  Performance features of the PA7100 microprocessor , 1993, IEEE Micro.

[23]  Kunle Olukotun,et al.  Performance Optimization of Pipelined Primary Caches , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[24]  André Seznec,et al.  A case for two-way skewed-associative caches , 1993, ISCA '93.

[25]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[26]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[27]  Edward McLellan The Alpha AXP architecture and 21064 processor , 1993, IEEE Micro.

[28]  Rajiv V. Joshi,et al.  A 2-ns cycle, 3.8-ns access 512-kb CMOS ECL SRAM with a fully pipelined architecture , 1991 .