Cache decay: exploiting generational behavior to reduce cache leakage power

Power dissipation is increasingly important in CPUs ranging from those intended for mobile use, all the way up to high-performance processors for high-end servers. While the bulk of the power dissipated is dynamic switching power, leakage power is also beginning to be a concern. Chipmakers expect that in future chip generations, leakage's proportion of total chip power will increase significantly. This paper examines methods for reducing leakage power within the cache memories of the CPU. Because caches comprise much of a CPU chip's area and transistor counts, they are reasonable targets for attacking leakage. We discuss policies and implementations for reducing cache leakage by invalidating and “turning off” cache lines when they hold data not likely to be reused. In particular, our approach is targeted at the generational nature of cache line usage. That is, cache lines typically have a flurry of frequent use when first brought into the cache, and then have a period of “dead time” before they are evicted. By devising effective, low-power ways of deducing dead time, our results show that in many cases we can reduce LI cache leakage energy by 4x in SPEC2000 applications without impacting performance. Because our decay-based techniques have notions of competitive on-line algorithms at their roots, their energy usage can be theoretically bounded at within a factor of two of the optimal oracle-based policy. We also examine adaptive decay-based policies that make energy-minimizing policy choices on a per-application basis by choosing appropriate decay intervals individually for each cache line. Our proposed adaptive policies effectively reduce LI cache leakage energy by 5x for the SPEC2000 with only negligible degradations in performance.

[1]  A. L. Rosenberg,et al.  Improving Replacement Decisions in Set-Associative Caches TITLE2: , 2001 .

[2]  Kaushik Roy,et al.  Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories , 2000, ISLPED '00.

[3]  Jih-Kwon Peir,et al.  Capturing dynamic memory reference behavior with adaptive cache topology , 1998, ASPLOS VIII.

[4]  S. Turner,et al.  Performance Analysis Using the MIPS R10000 Performance Counters , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[5]  Kanad Ghose,et al.  Analytical energy dissipation models for low-power caches , 1997, ISLPED '97.

[6]  Anna R. Karlin,et al.  Empirical studies of competitve spinning for a shared-memory multiprocessor , 1991, SOSP '91.

[7]  M. Malik,et al.  Operating Systems , 1992, Lecture Notes in Computer Science.

[8]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[9]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[10]  Wen-mei W. Hwu,et al.  Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[11]  Jeffrey Dean,et al.  ProfileMe: hardware support for instruction-level profiling on out-of-order processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[12]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[13]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[14]  Gary S. Tyson,et al.  Eager writeback-a technique for improving bandwidth utilization , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[15]  Babak Falsafi,et al.  Selective, accurate, and timely self-invalidation using last-touch prediction , 2000, ISCA '00.

[16]  Gurindar S. Sohi,et al.  A static power model for architects , 2000, MICRO 33.

[17]  Arnold L. Rosenberg,et al.  Improving Replacement Decisions in Set-Associative Caches , 2001 .

[18]  Kaushik Roy,et al.  An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance I-caches , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[19]  James R. Goodman,et al.  The declining effectiveness of dynamic caching for general- purpose microprocessors , 1995 .

[20]  William J. Bowhill,et al.  Circuit Implementation of a 300-MHz 64-bit Second-generation CMOS Alpha CPU , 1995, Digit. Tech. J..

[21]  Anna R. Karlin,et al.  Near-Optimal Parallel Prefetching and Caching , 2000, SIAM J. Comput..

[22]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[23]  David A. Wood,et al.  Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[24]  Kenneth M. Wilson,et al.  Designing High Bandwidth On-chip Caches , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[25]  Mark C. Johnson,et al.  Estimation of standby leakage power in CMOS circuits considering accurate modeling of transistor stacks , 1998, ISLPED '98.

[26]  Yale N. Patt,et al.  A Comparison Of Dynamic Branch Predictors That Use Two Levels Of Branch History , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[27]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[28]  Brian N. Bershad,et al.  Reducing TLB and memory overhead using online superpage promotion , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[29]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[30]  Stefanos Kaxiras,et al.  Coherence communication prediction in shared-memory multiprocessors , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[31]  David A. Wood,et al.  A model for estimating trace-sample miss ratios , 1991, SIGMETRICS '91.

[32]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[33]  V. Rich Personal communication , 1989, Nature.