Use It or Lose It

Moore's Law scaling continues to yield higher transistor density with each succeeding process generation, leading to today's many-core chip multiprocessors (CMPs) with tens or even hundreds of interconnected cores or tiles. Unfortunately, deep submicron CMOS process technology is marred by increasing susceptibility to wear. Prolonged operational stress gives rise to accelerated wearout and failure due to several physical failure mechanisms, including hot-carrier injection (HCI) and negative-bias temperature instability (NBTI). Each failure mechanism correlates with different usage-based stresses, all of which can eventually generate permanent faults. While the wearout of an individual core in many-core CMPs may not necessarily be catastrophic, a single fault in the interprocessor network-on-chip (NoC) fabric could render the entire chip useless, as it could lead to protocol-level deadlocks, or even partition away vital components such as the memory controller or other critical I/O. In this article, we study HCI- and NBTI-induced wear due to actual stresses caused by real workloads, applied onto the interconnect microarchitecture and develop a critical path model for NBTI-induced wearout. A key finding of this modeling is that, counter to prevailing wisdom, wearout in the CMP's on-chip interconnect is correlated with lack of load observed in the NoC routers rather than high load. We then develop a novel wearout-decelerating scheme in which routers under low load have their wear-sensitive components exercised without significantly impacting cycle time, pipeline depth, area, or power consumption of the overall router. A novel deterministic approach is proposed for the generation of appropriate exercise-mode data, ensuring design parameter targets are met. We subsequently show that the proposed design yields an ∼2,300× decrease in the rate of wear.

[1]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[2]  Weidong Liu,et al.  An accurate and scalable MOSFET aging model for circuit simulation , 2011, 2011 12th International Symposium on Quality Electronic Design.

[3]  Chris Auth,et al.  22-nm fully-depleted tri-gate CMOS transistors , 2012, Proceedings of the IEEE 2012 Custom Integrated Circuits Conference.

[4]  Vishwani D. Agrawal,et al.  Essentials of electronic testing for digital, memory, and mixed-signal VLSI circuits [Book Review] , 2000, IEEE Circuits and Devices Magazine.

[5]  Paul Gratz,et al.  Use it or lose it: Wear-out and lifetime in future chip multiprocessors , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Prabhakar Goel,et al.  An Implicit Enumeration Algorithm to Generate Tests for Combinational Logic Circuits , 1981, IEEE Transactions on Computers.

[7]  Xiaojun Li,et al.  Compact Modeling of MOSFET Wearout Mechanisms for Circuit-Reliability Simulation , 2008, IEEE Transactions on Device and Materials Reliability.

[8]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[9]  Sanghamitra Roy,et al.  An MILP-based aging-aware routing algorithm for NoCs , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Qiang Xu,et al.  AgeSim: A simulation framework for evaluating the lifetime reliability of processor-based SoCs , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[11]  Sorin Cotofana,et al.  A unified aging model of NBTI and HCI degradation towards lifetime reliability management for nanoscale MOSFET circuits , 2011, 2011 IEEE/ACM International Symposium on Nanoscale Architectures.

[12]  Josep Torrellas,et al.  The BubbleWrap many-core: Popping cores for sequential acceleration , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  William J. Dally,et al.  A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[14]  Georges G. E. Gielen,et al.  A methodology for measuring transistor ageing effects towards accurate reliability simulation , 2009, 2009 15th IEEE International On-Line Testing Symposium.

[15]  Subhasish Mitra,et al.  CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns , 2008, 2008 Design, Automation and Test in Europe.

[16]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[17]  Li-Shiuan Peh,et al.  DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in multi-core chips , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[18]  T. Numata,et al.  Performance, variability and reliability of silicon tri-gate nanowire MOSFETs , 2012, 2012 IEEE International Reliability Physics Symposium (IRPS).

[19]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[20]  K.J. Kuhn,et al.  Reducing Variation in Advanced Logic Technologies: Approaches to Process and Design for Manufacturability of Nanoscale CMOS , 2007, 2007 IEEE International Electron Devices Meeting.

[21]  Robert P. Dick,et al.  Static NBTI Reduction Using Internal Node Control , 2012, TODE.

[22]  Sachin S. Sapatnekar,et al.  Impact of NBTI on SRAM read stability and design for reliability , 2006, 7th International Symposium on Quality Electronic Design (ISQED'06).

[23]  Fan Yang,et al.  Statistical reliability analysis under process variation and aging effects , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[24]  Pradip Bose,et al.  A Framework for Architecture-Level Lifetime Reliability Modeling , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[25]  Wolfgang Rosenstiel,et al.  Fully Adaptive Fault-Tolerant Routing Algorithm for Network-on-Chip Architectures , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[26]  Sanghamitra Roy,et al.  Towards graceful aging degradation in NoCs through an adaptive routing algorithm , 2012, DAC Design Automation Conference 2012.

[27]  Erika Gunadi,et al.  Combating Aging with the Colt Duty Cycle Equalizer , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[28]  Saurabh Dighe,et al.  A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling , 2011, IEEE Journal of Solid-State Circuits.

[29]  Ulf Schlichtmann,et al.  A compact model for NBTI degradation and recovery under use-profile variations and its application to aging analysis of digital integrated circuits , 2014, Microelectron. Reliab..

[30]  David Blaauw,et al.  A highly resilient routing algorithm for fault-tolerant NoCs , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[31]  Melvin A. Breuer,et al.  Digital systems testing and testable design , 1990 .

[32]  Mehdi Baradaran Tahoori,et al.  ExtraTime: Modeling and analysis of wearout due to transistor aging at microarchitecture-level , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[33]  Sani R. Nassif,et al.  High Performance CMOS Variability in the 65nm Regime and Beyond , 2006, 2007 IEEE International Electron Devices Meeting.

[34]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35]  Jaume Abella,et al.  Penelope: The NBTI-Aware Processor , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[36]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[37]  Hai Zhou,et al.  Parallel CAD: Algorithm Design and Programming Special Section Call for Papers TODAES: ACM Transactions on Design Automation of Electronic Systems , 2010 .

[38]  Maria K. Michael,et al.  Test Set Generation with a Large Number of Unspecified Bits Using Static and Dynamic Techniques , 2010, IEEE Transactions on Computers.

[39]  Sachin S. Sapatnekar,et al.  Employing circadian rhythms to enhance power and reliability , 2013, TODE.

[40]  Tao Li,et al.  Architecting reliable multi-core network-on-chip for small scale processing technology , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[41]  Chrysostomos Nicopoulos,et al.  DaemonGuard: O/S-assisted selective software-based Self-Testing for multi-core systems , 2013, 2013 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[42]  Ming Zhang,et al.  Circuit Failure Prediction and Its Application to Transistor Aging , 2007, 25th IEEE VLSI Test Symposium (VTS'07).

[43]  Alain Greiner,et al.  A reconfigurable routing algorithm for a fault-tolerant 2D-Mesh Network-on-Chip , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[44]  Babak Falsafi,et al.  Detecting Emerging Wearout Faults , 2007 .

[45]  Mircea R. Stan,et al.  Modeling and experimental demonstration of accelerated self-healing techniques , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[46]  Mehdi Baradaran Tahoori,et al.  Aging-Aware Design of Microprocessor Instruction Pipelines , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[47]  Sachin S. Sapatnekar,et al.  Incorporating Hot-Carrier Injection Effects Into Timing Analysis for Large Circuits , 2014, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[48]  Mehdi Baradaran Tahoori,et al.  ArISE: Aging-aware instruction set encoding for lifetime improvement , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[49]  Ishiuchi,et al.  Alpha-Power Law MOSFET Model and its Applications to CMOS Inverter Delay and Other Formulas , 2004 .

[50]  Hossam A. H. Fahmy,et al.  Design Framework to Overcome Aging Degradation of the 16 nm VLSI Technology Circuits , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[51]  Kewal K. Saluja,et al.  NBTI Degradation: A Problem or a Scare? , 2008, 21st International Conference on VLSI Design (VLSID 2008).

[52]  Puneet Gupta,et al.  BTI-Gater: An Aging-Resilient Clock Gating Methodology , 2014, IEEE Journal on Emerging and Selected Topics in Circuits and Systems.