EECache: A Comprehensive Study on the Architectural Design for Energy-Efficient Last-Level Caches in Chip Multiprocessors

Power management for large last-level caches (LLCs) is important in chip multiprocessors (CMPs), as the leakage power of LLCs accounts for a significant fraction of the limited on-chip power budget. Since not all workloads running on CMPs need the entire cache, portions of a large, shared LLC can be disabled to save energy. In this article, we explore different design choices, from circuit-level cache organization to microarchitectural management policies, to propose a low-overhead runtime mechanism for energy reduction in the large, shared LLC. We first introduce a slice-based cache organization that can shut down parts of the shared LLC with minimal circuit overhead. Based on this slice-based organization, part of the shared LLC can be turned off according to the spatial and temporal cache access behavior captured by low-overhead sampling-based hardware. In order to eliminate the performance penalties caused by flushing data before powering off a cache slice, we propose data migration policies to prevent the loss of useful data in the LLC. Results show that our energy-efficient cache design (EECache) provides 14.1% energy savings at only 1.2% performance degradation and consumes negligible hardware overhead compared to prior work.

[1]  Jason Cong,et al.  An energy-efficient adaptive hybrid cache , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[2]  E. Alon,et al.  The implementation of a 2-core, multi-threaded itanium family processor , 2006, IEEE Journal of Solid-State Circuits.

[3]  Balaram Sinharoy,et al.  The implementation of POWER7TM: A highly parallel and scalable multi-core high-end server processor , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[4]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[5]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Jie S. Hu,et al.  Optimizing the thermal behavior of subarrayed data caches , 2005, 2005 International Conference on Computer Design.

[7]  Jason M. Allred,et al.  Designing for dark silicon: a methodological perspective on energy efficient systems , 2012, ISLPED '12.

[8]  Sunggu Lee,et al.  A novel tag access scheme for low power L2 cache , 2011, 2011 Design, Automation & Test in Europe.

[9]  David H. Albonesi,et al.  Selective cache ways: on-demand cache resource allocation , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[10]  Zhao Zhang,et al.  FlexiWay: A cache energy saving technique using fine-grained cache reconfiguration , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[11]  Nam Sung Kim,et al.  Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[12]  Hsien-Hsin S. Lee,et al.  Way guard: a segmented counting bloom filter approach to reducing energy for set-associative caches , 2009, ISLPED.

[13]  S. Tam,et al.  A Dual-Core Multi-Threaded Xeon Processor with 16MB L3 Cache , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[14]  Björn Franke,et al.  Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[15]  Babak Falsafi,et al.  Exploiting choice in resizable cache design to optimize deep-submicron processor energy-delay , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[16]  Jihong Kim,et al.  Replication-aware leakage management in chip multiprocessors with private L2 caches , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[17]  Pierfrancesco Foglia,et al.  A workload independent energy reduction strategy for D-NUCA caches , 2013, The Journal of Supercomputing.

[18]  T. Mudge,et al.  Drowsy caches: simple techniques for reducing leakage power , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[19]  Jiang Hu,et al.  Power gating with block migration in chip-multiprocessor last-level caches , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[20]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[21]  Yue Wang,et al.  Run-time power-gating in caches of GPUs for leakage energy savings , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[22]  Alessandro Bardine,et al.  Evaluation of Leakage Reduction Alternatives for Deep Submicron Dynamic Nonuniform Cache Architecture Caches , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[23]  Stefan Rusu,et al.  A 45nm 8-core enterprise Xeon ® processor , 2009 .

[24]  Meng-Fan Chang,et al.  Leveraging data lifetime for energy-aware last level non-volatile SRAM caches using redundant store elimination , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[25]  Michael M. Swift,et al.  FreshCache: Statically and dynamically exploiting dataless ways , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[26]  Jason Cong,et al.  Dynamically reconfigurable hybrid cache: An energy-efficient last-level cache design , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[27]  Shuai Wang,et al.  Thermal-Aware Subarrayed Data Cache Microarchitectures , 2009 .

[28]  Jonathan Chang,et al.  A 45 nm 8-Core Enterprise Xeon¯ Processor , 2010, IEEE J. Solid State Circuits.

[29]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[30]  Sunggu Lee,et al.  A Multistep Tag Comparison Method for a Low-Power L2 Cache , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[31]  Mahmut T. Kandemir,et al.  Design space exploration of workload-specific last-level caches , 2012, ISLPED '12.

[32]  Alessandro Bardine,et al.  Way adaptable D-NUCA caches , 2010, Int. J. High Perform. Syst. Archit..

[33]  Ken Smits,et al.  Penryn: 45-nm next generation Intel® core™ 2 processor , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[34]  Vikram Bhatt,et al.  GreenDroid: An architecture for the Dark Silicon Age , 2012, 17th Asia and South Pacific Design Automation Conference.

[35]  Christopher Mozak,et al.  Westmere: A family of 32nm IA processors , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[36]  Per Stenström,et al.  Leveraging Data Promotion for Low Power D-NUCA Caches , 2008, 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools.

[37]  Norman P. Jouppi,et al.  Multi-Core Cache Hierarchies , 2011, Multi-Core Cache Hierarchies.

[38]  Mahmut T. Kandemir,et al.  EECache: Exploiting design choices in energy-efficient last-level caches for chip multiprocessors , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[39]  Michael Bedford Taylor,et al.  Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse , 2012, DAC Design Automation Conference 2012.

[40]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[41]  Kaushik Roy,et al.  Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories , 2000, ISLPED '00.

[42]  Yiannakis Sazeides,et al.  Eliminating energy of same-content-cell-columns of on-chip SRAM arrays , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[43]  Changkyu Kim,et al.  Nonuniform Cache Architectures for Wire-Delay Dominated On-Chip Caches , 2003, IEEE Micro.

[44]  Margaret Martonosi,et al.  Cache decay: exploiting generational behavior to reduce cache leakage power , 2001, ISCA 2001.

[45]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[46]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[47]  Onur Mutlu,et al.  A Case for MLP-Aware Cache Replacement , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[48]  Amin Jadidi,et al.  High-endurance and performance-efficient design of hybrid cache architectures through adaptive line replacement , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.