Cross-Layer Approaches for Monitoring, Margining and Mitigation of Circuit Variability

Author(s): Lai, Liangzhen | Advisor(s): Gupta, Puneet | Abstract: With technology scaling, circuit performance has become more sensitive to various sources of variability, including manufacturing variations, ambient fluctuations, and circuit wear-out. These increased variations have created new challenges for conventional hardware guardbanding, as the additional design margin diminishes the benefits of technology scaling. This dissertation aims at reducing total system design margin with cross-layer approaches on monitoring, margining and mitigation of circuit variability. Since hardware and software adaptation can be used to reduce design margin with theexposed hardware variability provided by hardware monitors, we start by proposing twodifferent types of performance monitors that can achieve better monitoring accuracy andsmaller monitoring overhead. We also demonstrate the use of these performance monitors in system adaptation with our end-to-end implementation of software testbeds.We also study the dynamic variations and reliability margining problem in presence ofmonitor-and-actuate adaptation and emerging system contexts. In a system with monitor-and-actuate adaptation, dynamic variations require extra margin for monitor and actuate latencies. We analyze and study the margining problem considering different choices of the monitor and actuator types. System reliability margining strategies are also proposed for circuits in the “dark silicon” era, where the low-level design margin should consider the contexts of high-level power/thermal constraints.Last, we propose a clock gating methodology to mitigate the aging induced clock skew,which is difficult to monitor and resolve through adaptation. For certain phenomena andvariation sources, for example, soft error rates at different location/altitude, we also proposesystem/cloud-based monitors. An emulation platform is built to study the impacts ofdynamic power management schemes on system reliability.

[1]  Gang Quan,et al.  On-line reliability-aware dynamic power management for real-time systems , 2015, Sixteenth International Symposium on Quality Electronic Design.

[2]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[3]  Seda Ogrenci Memik,et al.  Optimizing Thermal Sensor Allocation for Microprocessors , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  Bishop Brock,et al.  Active Guardband Management in Power7+ to Save Energy and Maintain Reliability , 2013, IEEE Micro.

[5]  Gary D. Carpenter,et al.  Single-cycle, pulse-shaped critical path monitor in the POWER7+ microprocessor , 2013, International Symposium on Low Power Electronics and Design (ISLPED).

[6]  Sarita V. Adve,et al.  AS SCALING THREATENS TO ERODE RELIABILITY STANDARDS, LIFETIME RELIABILITY MUST BECOME A FIRST-CLASS DESIGN CONSTRAINT. MICROARCHITECTURAL INTERVENTION OFFERS A NOVEL WAY TO MANAGE LIFETIME RELIABILITY WITHOUT SIGNIFICANTLY SACRIFICING COST AND PERFORMANCE , 2005 .

[7]  Ming Zhang,et al.  Circuit Failure Prediction and Its Application to Transistor Aging , 2007, 25th IEEE VLSI Test Symposium (VTS'07).

[8]  W. Hunter Self-consistent solutions for allowed interconnect current density. II. Application to design guidelines , 1997 .

[9]  David Blaauw,et al.  Bubble Razor: Eliminating Timing Margins in an ARM Cortex-M3 Processor in 45 nm CMOS Using Architecturally Independent Error Detection and Correction , 2013, IEEE Journal of Solid-State Circuits.

[10]  Christian Bernard,et al.  Digital Timing Slack Monitors and Their Specific Insertion Flow for Adaptive Compensation of Variabilities , 2009, PATMOS.

[11]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[12]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[13]  Luca Benini,et al.  A Linux-governor based Dynamic Reliability Manager for android mobile devices , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Jörg Henkel,et al.  GUARD: GUAranteed reliability in dynamically reconfigurable systems , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Takayasu Sakurai,et al.  13% Power reduction in 16b integer unit in 40nm CMOS by adaptive power supply voltage control with parity-based error prediction and detection (PEPD) and fully integrated digital LDO , 2012, 2012 IEEE International Solid-State Circuits Conference.

[16]  Puneet Gupta,et al.  VarEMU: An emulation testbed for variability-aware software , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[17]  D. Schroder,et al.  Negative bias temperature instability: Road to cross in deep submicron silicon semiconductor manufacturing , 2003 .

[18]  Kyoungho Woo,et al.  Dual-DLL-based CMOS all-digital temperature sensor for microprocessor thermal monitoring , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[19]  P.J. Restle,et al.  Timing uncertainty measurements on the Power5 microprocessor , 2004, 2004 IEEE International Solid-State Circuits Conference (IEEE Cat. No.04CH37519).

[20]  Stephan Henzler,et al.  In-Situ Delay Characterization and Local Supply Voltage Adjustment for Compensation of Local Parametric Variations , 2007, IEEE Journal of Solid-State Circuits.

[21]  David Blaauw,et al.  In situ delay-slack monitor for high-performance processors using an all-digital self-calibrating 5ps resolution time-to-digital converter , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[22]  Yu Cao,et al.  Predictive Modeling of the NBTI Effect for Reliable Design , 2006, IEEE Custom Integrated Circuits Conference 2006.

[23]  Ching-Che Chung,et al.  An Autocalibrated All-Digital Temperature Sensor for On-Chip Thermal Monitoring , 2011, IEEE Transactions on Circuits and Systems II: Express Briefs.

[24]  Puneet Gupta,et al.  Hardware Variability-Aware Duty Cycling for Embedded Sensors , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[25]  Hidetoshi Onodera,et al.  Warning Prediction Sequential for Transient Error Prevention , 2010, 2010 IEEE 25th International Symposium on Defect and Fault Tolerance in VLSI Systems.

[26]  Puneet Gupta,et al.  Synthesis and Analysis of Design-Dependent Ring Oscillator (DDRO) Performance Monitors , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[27]  Natesan Venkateswaran,et al.  First-Order Incremental Block-Based Statistical Timing Analysis , 2006, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[28]  Petru Eles,et al.  Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems , 2007, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[29]  K.A. Bowman,et al.  Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance , 2009, IEEE Journal of Solid-State Circuits.

[30]  Puneet Gupta,et al.  On the efficacy of NBTI mitigation techniques , 2011, 2011 Design, Automation & Test in Europe.

[31]  Robert C. Aitken,et al.  Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[32]  David Z. Pan,et al.  Skew Management of NBTI Impacted Gated Clock Trees , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[33]  M.B. Ketchen,et al.  Ring oscillators for CMOS process tuning and variability control , 2006, IEEE Transactions on Semiconductor Manufacturing.

[34]  Hiroaki Suzuki,et al.  Phase-adjustable error detection flip-flops with 2-stage hold-driven optimization, slack-based grouping scheme and slack distribution control for dynamic voltage scaling , 2010, TODE.

[35]  Dakai Zhu Reliability-Aware Dynamic Energy Management in Dependable Embedded Real-Time Systems , 2006, IEEE Real Time Technology and Applications Symposium.

[36]  Josep Torrellas,et al.  The BubbleWrap many-core: Popping cores for sequential acceleration , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[37]  T.M. Mak,et al.  Built-In Soft Error Resilience for Robust System Design , 2007, 2007 IEEE International Conference on Integrated Circuit Design and Technology.

[38]  Adrian Evans,et al.  Comprehensive analysis of alpha and neutron particle-induced soft errors in an embedded processor at nanoscales , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[39]  Tajana Rosing,et al.  Proactive temperature balancing for low cost thermal management in MPSoCs , 2008, ICCAD 2008.

[40]  Reinhold Weicker,et al.  Dhrystone: a synthetic systems programming benchmark , 1984, CACM.

[41]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[42]  Soraya Ghiasi,et al.  A Distributed Critical-Path Timing Monitor for a 65nm High-Performance Microprocessor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[43]  Lin Xie,et al.  Representative path selection for post-silicon timing prediction under variability , 2010, Design Automation Conference.

[44]  Rabi N. Mahapatra,et al.  Reliability aware power management for dual-processor real-time embedded systems , 2010, Design Automation Conference.

[45]  Chin-Chung Tsai,et al.  A time-to-digital-converter-based CMOS smart temperature sensor , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[46]  Yu Cao,et al.  Statistical aging under dynamic voltage scaling: A logarithmic model approach , 2012, Proceedings of the IEEE 2012 Custom Integrated Circuits Conference.

[47]  José Ignacio Hidalgo,et al.  Adaptive Task Migration Policies for Thermal Control in MPSoCs , 2010, ISVLSI.

[48]  Yu Cao,et al.  An Integrated Modeling Paradigm of Circuit Reliability for 65nm CMOS Technology , 2007, 2007 IEEE Custom Integrated Circuits Conference.

[49]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[50]  B. L. Bhuva,et al.  Analysis of soft error rates in combinational and sequential logic and implications of hardening for advanced technologies , 2010, 2010 IEEE International Reliability Physics Symposium.

[51]  Masanori Hashimoto,et al.  Adaptive performance compensation with in-situ timing error prediction for subthreshold circuits , 2009, 2009 IEEE Custom Integrated Circuits Conference.

[52]  Sachin S. Sapatnekar,et al.  Impact of NBTI on SRAM read stability and design for reliability , 2006, 7th International Symposium on Quality Electronic Design (ISQED'06).

[53]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[54]  Xiaoxiao Wang,et al.  Path-RO: a novel on-chip critical path delay measurement under process variations , 2008, ICCAD 2008.

[55]  Wei Liu,et al.  NBTI effects on tree-like clock distribution networks , 2012, GLSVLSI '12.

[56]  Dakai Zhu,et al.  Global scheduling based reliability-aware power management for multiprocessor real-time systems , 2011, Real-Time Systems.

[57]  Puneet Gupta,et al.  ECO cost measurement and incremental gate sizing for late process changes , 2013, TODE.

[58]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[59]  E. Mintarno,et al.  Workload dependent NBTI and PBTI analysis for a sub-45nm commercial microprocessor , 2013, 2013 IEEE International Reliability Physics Symposium (IRPS).

[60]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[61]  David Z. Pan,et al.  Analysis and optimization of NBTI induced clock skew in gated clock trees , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[62]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[63]  Sachin S. Sapatnekar,et al.  Capturing Post-Silicon Variations Using a Representative Critical Path , 2010, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[64]  Dan Alexandrescu A comprehensive soft error analysis methodology for SoCs/ASICs memory instances , 2011, 2011 IEEE 17th International On-Line Testing Symposium.

[65]  Shih-Hsu Huang,et al.  Low-power anti-aging zero skew clock gating , 2013, TODE.

[66]  Meeta Sharma Gupta,et al.  Voltage emergency prediction: Using signatures to reduce operating margins , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[67]  Toshinori Sato,et al.  A Simple Flip-Flop Circuit for Typical-Case Designs for DFM , 2007, 8th International Symposium on Quality Electronic Design (ISQED'07).

[68]  Jason Cong,et al.  Behavior-Level Observability Analysis for Operation Gating in Low-Power Behavioral Synthesis , 2010, TODE.

[69]  Enrico Macii,et al.  NBTI-aware power gating for concurrent leakage and aging optimization , 2009, ISLPED.

[70]  J. Hicks 45nm Transistor Reliability , 2008 .

[71]  J. Tschanz,et al.  Tunable replica circuits and adaptive voltage-frequency techniques for dynamic voltage, temperature, and aging variation tolerance , 2009, 2009 Symposium on VLSI Circuits.

[72]  H.H.K. Tang,et al.  Measurement of the flux and energy spectrum of cosmic-ray induced neutrons on the ground , 2004, IEEE Transactions on Nuclear Science.

[73]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[74]  Puneet Gupta,et al.  Accurate and inexpensive performance monitoring for variability-aware systems , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[75]  Luca Benini,et al.  Automatic synthesis of low-power gated-clock finite-state machines , 1996, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[76]  James Charles,et al.  Evaluation of the Intel® Core™ i7 Turbo Boost feature , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[77]  Muhammad Shafique,et al.  Cross-Layer Software Dependability on Unreliable Hardware , 2016, IEEE Transactions on Computers.

[78]  Puneet Gupta,et al.  SlackProbe: A Flexible and Efficient In Situ Timing Slack Monitoring Methodology , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[79]  David Blaauw,et al.  Reliability modeling and management in dynamic microprocessor-based systems , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[80]  Mark Mohammad Tehranipoor,et al.  A novel flow for reducing clock skew considering NBTI effect and process variations , 2013, International Symposium on Quality Electronic Design (ISQED).

[81]  Kofi A. A. Makinwa,et al.  A CMOS smart temperature sensor with a batch-calibrated inaccuracy of ±0.25°C (3σ) from −70°C to 130°C , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[82]  Puneet Gupta,et al.  SlackProbe: A low overhead in situ on-line timing slack monitoring methodology , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[83]  Yiran Chen,et al.  Deterministic clock gating for microprocessor power reduction , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[84]  K.A. Jenkins,et al.  A clock distribution network for microprocessors , 2000, 2000 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.00CH37103).

[85]  Dennis Sylvester,et al.  Razor-lite: A side-channel error-detection register for timing-margin recovery in 45nm SOI CMOS , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[86]  Meeta Sharma Gupta,et al.  Understanding Voltage Variations in Chip Multiprocessors using a Distributed Power-Delivery Network , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[87]  Giovanni De Micheli,et al.  Power and Reliability Management of SoCs , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[88]  Andrew B. Kahng,et al.  Impact of adaptive voltage scaling on aging-aware signoff , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[89]  Andrew B. Kahng,et al.  Tunable sensors for process-aware voltage scaling , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[90]  Andrew B. Kahng,et al.  On potential design impacts of electromigration awareness , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[91]  David M. Bull,et al.  RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance , 2009, IEEE Journal of Solid-State Circuits.

[92]  Robert C. Aitken,et al.  Low Power Methodology Manual - for System-on-Chip Design , 2007 .

[93]  Stephen P. Boyd,et al.  Self-Tuning for Maximized Lifetime Energy-Efficiency in the Presence of Circuit Aging , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[94]  Paolo A. Aseron,et al.  All-Digital Circuit-Level Dynamic Variation Monitor for Silicon Debug and Adaptive Clock Control , 2011, IEEE Transactions on Circuits and Systems I: Regular Papers.

[95]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[96]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[97]  Efraim Rotem,et al.  Power-Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge , 2012, IEEE Micro.

[98]  Robert C. Aitken,et al.  Time-Borrowing Circuit Designs and Hardware Prototyping for Timing Error Resilience , 2014, IEEE Transactions on Computers.

[99]  Kaustav Banerjee,et al.  Coupled analysis of electromigration reliability and performance in ULSI signal nets , 2001, ICCAD.

[100]  Puneet Gupta,et al.  DDRO: A novel performance monitoring methodology based on design-dependent ring oscillators , 2012, Thirteenth International Symposium on Quality Electronic Design (ISQED).

[101]  Muhammad Shafique,et al.  dTune: Leveraging reliable code generation for adaptive dependability tuning under process variation and aging-induced effects , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[102]  Lara Dolecek,et al.  Underdesigned and Opportunistic Computing in Presence of Hardware Variability , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.