Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors

The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propose techniques for improving the bandwidth of a single cache port by using additional buffering in the processor, and by taking maximum advantage of a wider cache port. We evaluate these techniques using realistic applications that include the operating system. Our techniques using a single-ported cache achieve 91% of the performance of a dual-ported cache.

[1]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[2]  Jeff Yetter,et al.  Performance features of the PA7100 microprocessor , 1993, IEEE Micro.

[3]  Thomas M. Conte Tradeoffs in processor/memory interfaces for superscalar processors , 1992, MICRO 1992.

[4]  Gurindar S. Sohi,et al.  High-bandwidth data memory systems for superscalar processors , 1991, ASPLOS IV.

[5]  Norman P. Jouppi,et al.  WRL Research Report 93/5: An Enhanced Access and Cycle Time Model for On-chip Caches , 1994 .

[6]  Chung-Ho Chen,et al.  A unified architectural tradeoff methodology , 1994, ISCA '94.

[7]  Trevor Mudge,et al.  Performance optimization of pipelined primary cache , 1992, ISCA '92.

[8]  Trevor N. Mudge,et al.  Resource allocation in a high clock rate microprocessor , 1994, ASPLOS VI.

[9]  Jean-Loup Baer,et al.  Reducing memory latency via non-blocking and prefetching caches , 1992, ASPLOS V.

[10]  Ann Marie Grizzaffi Maynard,et al.  Contrasting characteristics and cache performance of technical and multi-user commercial workloads , 1994, ASPLOS VI.

[11]  Gary S. Tyson,et al.  A study of single-chip processor/cache organizations for large numbers of transistors , 1994, ISCA '94.

[12]  Mark Horowitz,et al.  Performance tradeoffs in cache design , 1988, ISCA '88.

[13]  Jim Gray,et al.  Benchmark Handbook: For Database and Transaction Processing Systems , 1992 .

[14]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[15]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[16]  Anoop Gupta,et al.  The Stanford FLASH Multiprocessor , 1994, ISCA.

[17]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[18]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[19]  Kunle Olukotun,et al.  Performance Optimization of Pipelined Primary Caches , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[20]  Anoop Gupta,et al.  The impact of architectural trends on operating system performance , 1995, SOSP.

[21]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[22]  Thomas M. Conte Tradeoffs in processor/memory interfaces for superscalar processors , 1992, MICRO.

[23]  Zarka Cvetanovic,et al.  Characterization of Alpha AXP performance using TP and SPEC workloads , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[24]  Norman P. Jouppi Cache write policies and performance , 1993, ISCA '93.

[25]  David W. Wall,et al.  Limits of instruction-level parallelism , 1991, ASPLOS IV.

[26]  Dionisios N. Pnevmatikatos,et al.  Cache performance of the SPEC92 benchmark suite , 1993, IEEE Micro.

[27]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[28]  Edward McLellan The Alpha AXP architecture and 21064 processor , 1993, IEEE Micro.

[29]  Mendel Rosenblum,et al.  Embra: fast and flexible machine simulation , 1996, SIGMETRICS '96.

[30]  Rajiv V. Joshi,et al.  A 2-ns cycle, 3.8-ns access 512-kb CMOS ECL SRAM with a fully pipelined architecture , 1991 .

[31]  Michael J. Flynn,et al.  Performance Factors for Superscalar Processors , 1995 .

[32]  Anoop Gupta,et al.  Complete computer system simulation: the SimOS approach , 1995, IEEE Parallel Distributed Technol. Syst. Appl..