Dynamic Guardband Selection: Thermal-Aware Optimization for Unreliable Multi-Core Systems

Circuit aging has become the major reliability concern in current and upcoming technology nodes. For instance, Bias Temperature Instability (BTI) leads to an increase in the threshold voltage of a transistor. That, in turn, may prolong the critical path delay of the processor and eventually may lead to timing errors. In order to avoid aging-induced timing errors, designers employ guardbands either with respect to voltage or frequency. State-of-the-art techniques determine a guardband type at the circuit level at design time irrespective from the running workload at the system level. Our investigation revealed that generated temperatures by a running workload have the potential to play a key role in determining the appropriate guardband type with respect to system performance. Therefore, we propose a paradigm shift in designing guardbands: to select the guardband types on-the-fly with respect to the workload-induced temperatures aiming at optimizing for performance under temperature and reliability constraints. Moreover, different guardband types for different cores can be selected simultaneously when multiple applications with diverse properties suggest this to be useful. Our dynamic guardband selection allows for a higher performance compared to techniques that employ a fixed (at design time) guardband type throughout.

[1]  David Pisinger,et al.  Algorithms for Knapsack Problems , 1995 .

[2]  Xiang Pan,et al.  VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[3]  Josep Torrellas,et al.  Facelift: Hiding and slowing down aging in multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[4]  S. Mahapatra,et al.  A consistent physical framework for N and P BTI in HKMG MOSFETs , 2012, 2012 IEEE International Reliability Physics Symposium (IRPS).

[5]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  Jörg Henkel,et al.  TAPE: Thermal-aware agent-based power econom multi/many-core architectures , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[7]  Jörg Henkel,et al.  Towards interdependencies of aging mechanisms , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[8]  Andrew B. Kahng,et al.  Impact of adaptive voltage scaling on aging-aware signoff , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  Radu Teodorescu,et al.  Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Sherief Reda,et al.  Frequency and voltage planning for multi-core processors under thermal constraints , 2008, 2008 IEEE International Conference on Computer Design.

[11]  Heba Khdr,et al.  Thermal constrained resource management for mixed ILP-TLP workloads in dark silicon chips , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[12]  Mircea R. Stan,et al.  Work hard, sleep well - Avoid irreversible IC wearout with proactive rejuvenation , 2016, 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC).

[13]  David A. Patterson,et al.  The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor , 2015 .

[14]  Fabrice Paillet,et al.  FIVR — Fully integrated voltage regulators on 4th generation Intel® Core™ SoCs , 2014, 2014 IEEE Applied Power Electronics Conference and Exposition - APEC 2014.

[15]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[16]  Jürgen Teich,et al.  Power Density-Aware Resource Management for Heterogeneous Tiled Multicores , 2017, IEEE Transactions on Computers.

[17]  Jörg Henkel,et al.  Reliability-aware design to suppress aging , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[18]  Mehdi Baradaran Tahoori,et al.  Aging-aware logic synthesis , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[19]  Jingwen Leng,et al.  Adaptive guardband scheduling to improve system-level efficiency of the POWER7+ , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  S. Martello,et al.  Algorithms for Knapsack Problems , 1987 .

[21]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Meeta Sharma Gupta,et al.  Voltage emergency prediction: Using signatures to reduce operating margins , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[23]  Heba Khdr,et al.  mDTM: Multi-objective dynamic thermal management for on-chip systems , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[24]  Stephen P. Boyd,et al.  Optimized self-tuning for circuit aging , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[25]  Jörg Henkel,et al.  Aging-aware voltage scaling , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[26]  Ali M. Niknejad,et al.  BSIM — Industry standard compact MOSFET models , 2012, 2012 Proceedings of the ESSCIRC (ESSCIRC).

[27]  Sarma B. K. Vrudhula,et al.  Performance Optimal Online DVFS and Task Migration Techniques for Thermally Constrained Multi-Core Processors , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[28]  Larry Rudolph,et al.  Metrics and Benchmarking for Parallel Job Scheduling , 1998, JSSPP.

[29]  Nam Sung Kim,et al.  Optimizing throughput of power- and thermal-constrained multicore processors using DVFS and per-core power-gating , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[30]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Jörg Henkel,et al.  TAPE: thermal-aware agent-based power economy for multi/many-core architectures , 2009, ICCAD '09.

[32]  Jingwen Leng,et al.  GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[33]  Ulf Schlichtmann,et al.  Schedulability Analysis for Processors with Aging-Aware Autonomic Frequency Scaling , 2012, 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications.

[34]  Chris H. Kim,et al.  Estimation of instantaneous frequency fluctuation in a fast DVFS environment using an empirical BTI stress-relaxation model , 2014, 2014 IEEE International Reliability Physics Symposium.

[35]  Jörg Henkel,et al.  Reliability in Super- and Near-Threshold Computing: A Unified Model of RTN, BTI, and PV , 2018, IEEE Transactions on Circuits and Systems I: Regular Papers.

[36]  Bishop Brock,et al.  Active management of timing guardband to save energy in POWER7 , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  Michael D. Smith,et al.  Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[38]  Jörg Henkel,et al.  Containing guardbands , 2017, 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC).