Performance Constrained Static Energy Reduction Using Way-Sharing Target-Banks

Most of chip-multiprocessors share a common large sized last level cache (LLC). In non-uniform cache access based architectures, the LLC is divided into multiple banks to be accessed independently. It has been observed that the principal amount of chip power in CMP is consumed by the LLC banks which can be divided into two major parts: dynamic and static. Techniques have been proposed to reduce the static power consumption of LLC by powering off the less utilized banks and forwarding its requests to other active banks (target banks). Once a bank is powered off, all the future requests arrive to its controller and get forwarded to the target bank. Such a bank shutdown process saves static power but reduces the performance of LLC. Due to multiple banks shutdown the target banks may also get overloaded. Additionally, the request forwarding increases the on chip traffic. In this paper, we improve the performance of the target banks by dynamically managing its associativity. The cost of request forwarding is optimized by considering network distance as an additional metric for target selection. These two strategies help to reduce performance degradation. Experimental analysis shows 43% reduction in static energy and 23% reduction in EDP for a 4MB LLC with a performance constraint of 3%.

[1]  Christoforos E. Kozyrakis,et al.  The ZCache: Decoupling Ways and Associativity , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Shirshendu Das,et al.  Victim retention for reducing cache misses in tiled chip multiprocessors , 2014, Microprocess. Microsystems.

[3]  Timothy M. Jones,et al.  The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation , 2012, International Journal of Parallel Programming.

[4]  Subramanian Ramaswamy,et al.  Improving cache efficiency via resizing + remapping , 2007, 2007 25th International Conference on Computer Design.

[5]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, MICRO.

[6]  Shirshendu Das,et al.  Static energy reduction by performance linked cache capacity management in tiled CMPs , 2015, SAC.

[7]  Alessandro Bardine,et al.  Analysis of static and dynamic energy consumption in NUCA caches: initial results , 2007, MEDEA '07.

[8]  Pierfrancesco Foglia,et al.  A workload independent energy reduction strategy for D-NUCA caches , 2013, The Journal of Supercomputing.

[9]  Mahmut T. Kandemir,et al.  Leakage energy management in cache hierarchies , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[10]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[11]  Ann Gordon-Ross,et al.  A survey on cache tuning from a power/energy perspective , 2013, CSUR.

[12]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[13]  Julio Sahuquillo,et al.  Drowsy cache partitioning for reduced static and dynamic energy in the cache hierarchy , 2013, 2013 International Green Computing Conference Proceedings.

[14]  Shirshendu Das,et al.  Dynamic Associativity Management Using Fellow Sets , 2013, 2013 International Symposium on Electronic System Design.

[15]  Eric Rotenberg,et al.  Adaptive mode control: A static-power-efficient cache design , 2003, TECS.

[16]  Shirshendu Das,et al.  Exploration of Migration and Replacement Policies for Dynamic NUCA over Tiled CMPs , 2015, 2015 28th International Conference on VLSI Design.

[17]  Mahmut T. Kandemir,et al.  Leakage Current: Moore's Law Meets Static Power , 2003, Computer.

[18]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[21]  Bharadwaj Amrutur,et al.  Adaptive Power Optimization of On-chip SNUCA Cache on Tiled Chip Multicore Architecture Using Remap Policy , 2011, 2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011).

[22]  Yale N. Patt,et al.  The V-Way cache: demand-based associativity via global replacement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[23]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[24]  Kai Ma,et al.  Cache Latency Control for Application Fairness or Differentiation in Power-Constrained Chip Multiprocessors , 2012, IEEE Transactions on Computers.

[25]  Alessandro Bardine,et al.  Way adaptable D-NUCA caches , 2010, Int. J. High Perform. Syst. Archit..

[26]  Kaushik Roy,et al.  Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories , 2000, ISLPED '00.