Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures

Conventional microarchitectures choose a single memory hierarchy design point targeted at the average application. In this paper, we propose a cache and TLB layout and design that leverages repeater insertion to provide dynamic low-cost configurability trading off size and speed on a per application phase basis. A novel configuration management algorithm dynamically detects phase changes and reacts to an application's hit and miss intolerance in order to improve memory hierarchy performance while taking energy consumption into consideration. When applied to a two-level cache and TLB hierarchy at 0.1 /spl mu/m technology, the result is an average 15% reduction in cycles per instruction (CPI), corresponding to an average 27% reduction in memory-CPI, across a broad class of applications compared to the best conventional two-level hierarchy of comparable size. Projecting to sub-.1 /spl mu/m technology design considerations that call for a three-level conventional cache hierarchy for performance reasons, we demonstrate that a configurable L2/L3 cache hierarchy coupled with a conventional LI results in an average 43% reduction in memory hierarchy energy in addition to improved performance.

[1]  Rajeev Balasubramonian,et al.  Dynamic Memory Hierarchy Performance Optimization , 2000 .

[2]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[3]  William J. Dally,et al.  Digital systems engineering , 1998 .

[4]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[5]  Michael J. Flynn,et al.  Limits of Scaling MOSFETs , 1995 .

[6]  Norman P. Jouppi,et al.  Complexity/performance tradeoffs with non-blocking loads , 1994, ISCA '94.

[7]  William J. Bowhill,et al.  Circuit Implementation of a 300-MHz 64-bit Second-generation CMOS Alpha CPU , 1995, Digit. Tech. J..

[8]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9]  David H. Albonesi Dynamic IPC/clock rate optimization , 1998, ISCA.

[10]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[11]  N. S. Barnett,et al.  Private communication , 1969 .

[12]  Ashok Kumar,et al.  The HP PA-8000 RISC CPU , 1997, IEEE Micro.

[13]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[14]  Norman P. Jouppi,et al.  Reconfigurable caches and their application to media processing , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[15]  Kanad Ghose,et al.  Analytical energy dissipation models for low-power caches , 1997, ISLPED '97.

[16]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[17]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[18]  Thorsten von Eicken,et al.  技術解説 IEEE Computer , 1999 .