Block Disabling Characterization and Improvements in CMPs Operating at Ultra-low Voltages

Power density has become the limiting factor in technology scaling as power budget restricts the amount of hardware that can be active at the same time. Reducing supply voltage to ultra-low voltage ranges close to the threshold region has the promise of great energy savings. However, the potential savings of voltage scaling are limited by the correct operation of SRAM cells, which is not guaranteed below V(ddmin), the minimum voltage in which cache structures operate reliably. Understanding the effects of operating below V(ddmin) requires complex modelling, so we introduce an updated probability failure model of SRAM cells at 22nm and explore the reliability impact of lowering the chip voltage supply below V(ddmin) in shared memory coherent chip-multiprocessors (CMP) running a variety of parallel workloads. A micro architectural technique to cope with cache reliability at ultra-low voltages is block disabling, however, in many cases, the savings in on-chip caches do not compensate for the consumption in the rest of the system, as the consumption increase of the off-chip memory may offset the on-chip gain. We make the case that existing coherence mechanisms can provide the substrate to improve energy savings with block disabling and propose two low-complexity techniques. Taking the best of both techniques we can scale voltage below V(ddmin) and reduce system energy up to 39%, and system energy-delay up to 10%. Besides, by lowering the CMP consumption in a power constrained scenario, we could activate offline cores, reaching a potential speedup between 3.7 and 4.4.

[1]  David Blaauw,et al.  Near-Threshold Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits , 2010, Proceedings of the IEEE.

[2]  Yiannakis Sazeides,et al.  Performance-effective operation below Vcc-min , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[3]  Rajesh Kumar,et al.  A family of 45nm IA processors , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[4]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[5]  Milo M. K. Martin,et al.  Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[6]  S. E. Schuster Multiple word/bit line redundancy for semiconductor memories , 1978 .

[7]  Jaydeep P. Kulkarni,et al.  Improving multi-core performance using mixed-cell cache architecture , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[8]  Michael Taylor A landscape of the new dark silicon design regime , 2013 .

[9]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[10]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[11]  Nikil D. Dutt,et al.  REMEDIATE: A scalable fault-tolerant architecture for low-power NUCA cache in tiled CMPs , 2013, 2013 International Green Computing Conference Proceedings.

[12]  Amin Ansari,et al.  Archipelago: A polymorphic cache design for enabling robust near-threshold operation , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[13]  Rumi Zahir,et al.  The Medfield Smartphone: Intel Architecture in a Handheld Form Factor , 2013, IEEE Micro.

[14]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Josep Torrellas,et al.  Coping with Parametric Variation at Near-Threshold Voltages , 2013, IEEE Micro.

[16]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[18]  Sani R. Nassif,et al.  A resilience roadmap: (invited paper) , 2010, DATE 2010.

[19]  George Varghese,et al.  A 22nm IA multi-CPU and GPU System-on-Chip , 2012, 2012 IEEE International Solid-State Circuits Conference.

[20]  Wei Chen,et al.  The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series , 2007, IEEE Journal of Solid-State Circuits.

[21]  Nam Sung Kim,et al.  Minimizing total area of low-voltage SRAM arrays through joint optimization of cell size, redundancy, and ECC , 2010, 2010 IEEE International Conference on Computer Design.

[22]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[23]  Wen-Hann Wang,et al.  On the inclusion properties for multi-level cache hierarchies , 1988, ISCA '88.

[24]  K. Roy,et al.  A 160 mV Robust Schmitt Trigger Based Subthreshold SRAM , 2007, IEEE Journal of Solid-State Circuits.

[25]  Hyunjin Lee,et al.  Performance of Graceful Degradation for Cache Faults , 2007, IEEE Computer Society Annual Symposium on VLSI (ISVLSI '07).

[26]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[27]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).