Reducing L1 caches power by exploiting software semantics

To access a set-associative L1 cache in a high-performance processor, all ways of the selected set are searched and fetched in parallel using physical address bits. Such a cache is oblivious of memory references' software semantics such as stack-heap bifurcation of the memory space, and user-kernel ring levels. This constitutes a waste of energy since e.g., a user-mode instruction fetch will never hit a cache block that contains kernel code. Similarly, a stack access will not hit a cacheline that contains heap data. We propose to exploit software semantics in cache design to avoid unnecessary associative searches, thus reducing dynamic power consumption. Specifically, we utilize virtual memory region properties to optimize the data cache and ring level information to optimize the instruction cache. Our design does not impact performance, and incurs very small hardware cost. Simulations results using SPEC CPU and SPECjapps indicate that the proposed designs help to reduce cache block fetches from DL1 and IL1 by 27% and 57% respectively, resulting in average savings of 15% of DL1 power and more than 30% of IL1 power compared to an aggressively clock-gated baseline.

[1]  Richard E. Kessler,et al.  Inexpensive Implementations Of Set-Associativity , 1989, The 16th Annual International Symposium on Computer Architecture.

[2]  Dirk Grunwald,et al.  Predictive sequential associative cache , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[3]  T. N. Vijaykumar,et al.  Reactive-associative caches , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[4]  Luiz André Barroso,et al.  Memory system characterization of commercial workloads , 1998, ISCA.

[5]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6]  Balaram Sinharoy POWER7 multi-core processor design , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Kaushik Roy,et al.  Reducing set-associative cache energy via way-prediction and selective direct-mapping , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[8]  Hsien-Hsin S. Lee,et al.  Energy efficient D-TLB and data cache using semantic-aware multilateral partitioning , 2003, ISLPED '03.

[9]  Susan J. Eggers,et al.  An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture , 2000, ASPLOS.

[10]  Richard Uhlig,et al.  SoftSDV: A Presilicon Software Development Environment for the IA-64 Architecture , 1999 .

[11]  M. Smelyanskiy,et al.  Stack value file: custom microarchitecture for the stack , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[12]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[13]  Michael C. Huang,et al.  L1 data cache decomposition for energy efficiency , 2001, ISLPED '01.

[14]  Li Zhao,et al.  Exploring Large-Scale CMP Architectures Using ManySim , 2007, IEEE Micro.

[15]  Narayanan Vijaykrishnan,et al.  Understanding and improving operating system effects in control flow prediction , 2002, ASPLOS X.