LEMap: Controlling leakage in large chip-multiprocessor caches via profile-guided virtual address translation

The emerging trend of larger number of cores or processors on a single chip in the server, desktop, and mobile notebook platforms necessarily demands larger amount of on-chip last level cache. However, larger caches threaten to dramatically increase the leakage power as the industry moves into deeper sub-micron technology. In this paper, with the aim of reducing leakage energy we introduce LEMap (low energy map), a novel virtual address translation scheme to control the set of physical pages mapped to each bank of a large multi-banked non-uniform access L2 cache shared across all the cores. Combination of profiling, a simple off-line clustering algorithm, and a new flavor of Irix-style application-directed page placement system call maps the virtual pages that are accessed in the L2 cache roughly together onto the same region of the cache. Thus LEMap makes the access windows of the pages mapped to a region roughly identical and increases the average idle time of a region. As a result, powering down a region after the last access to the clusters of the corresponding virtual pages saves a much bigger amount of L2 cache energy compared to a usual virtual address translation scheme that is oblivious to access patterns. Our execution-driven simulation of an eight-core chip-multiprocessor with a 16 MB shared L2 cache using a 65 nm process on eight shared memory parallel applications drawn from SPLASH-2, SPEC OMP, and DIS suites shows that LEMap, on average, saves 7% of total energy, 50% of L2 cache energy, and 52% of L2 cache power while suffering from a 3% loss in performance compared to a baseline system that employs drowsy cells as well as region power-down without access clustering.

[1]  Wei Zhang,et al.  Reducing data cache leakage energy using a compiler-based approach , 2005, TECS.

[2]  Saibal Mukhopadhyay,et al.  Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits , 2003, Proc. IEEE.

[3]  Frank Vahid,et al.  A highly configurable cache for low energy embedded systems , 2005, TECS.

[4]  Li-Shiuan Peh,et al.  Leakage power modeling and optimization in interconnection networks , 2003, ISLPED '03.

[5]  Krste Asanovic,et al.  Dynamic fine-grain leakage reduction using leakage-biased bitlines , 2002, ISCA.

[6]  Nikil Dutt,et al.  An Enhanced Power Estimation Model for On-Chip Caches , 2004 .

[7]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[8]  Rohit Bhatia,et al.  Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.

[9]  Wei Zhang,et al.  Compiler-directed instruction cache leakage optimization , 2002, MICRO.

[10]  Mahmut T. Kandemir,et al.  Exploiting program hotspots and code sequentiality for instruction cache leakage management , 2003, ISLPED '03.

[11]  Larry L. Biro,et al.  Power considerations in the design of the Alpha 21264 microprocessor , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[12]  Mahmut T. Kandemir,et al.  Compiler-directed array interleaving for reducing energy in multi-bank memories , 2002, Proceedings of ASP-DAC/VLSI Design 2002. 7th Asia and South Pacific Design Automation Conference and 15h International Conference on VLSI Design.

[13]  Lieven Eeckhout,et al.  Quantifying the Impact of Input Data Sets on Program Behavior and its Applications , 2003, J. Instr. Level Parallelism.

[14]  S. Tam,et al.  A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[15]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[16]  Kaushik Roy,et al.  Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories , 2000, ISLPED '00.

[17]  Johannes G. Janzen Calculating Memory System Power for DDR SDRAM , 2001 .

[18]  Carla Schlatter Ellis,et al.  Power aware page allocation , 2000, SIGP.

[19]  A. Devgan,et al.  Efficient techniques for gate leakage estimation , 2003, Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003. ISLPED '03..

[20]  Mahmut T. Kandemir,et al.  DRAM energy management using software and hardware directed power mode control , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[21]  Mahmut T. Kandemir,et al.  Managing Leakage Energy in Cache Hierarchies , 2003, J. Instr. Level Parallelism.

[22]  Kaushik Roy,et al.  An integrated circuit/architecture approach to reducing leakage in deep-submicron high-performance I-caches , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[23]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[24]  T. Mudge,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[25]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[26]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[27]  Mahmut T. Kandemir,et al.  Partitioned instruction cache architecture for energy efficiency , 2003, TECS.