Use it or lose it: Wear-out and lifetime in future chip multiprocessors

Moore's Law scaling is continuing to yield even higher transistor density with each succeeding process generation, leading to today's multi-core Chip Multi-Processors (CMPs) with tens or even hundreds of interconnected cores or tiles. Unfortunately, deep sub-micron CMOS process technology is marred by increasing susceptibility to wearout. Prolonged operational stress gives rise to accelerated wearout and failure, due to several physical failure mechanisms, including Hot Carrier Injection (HCI) and Negative Bias Temperature Instability (NBTI). Each failure mechanism correlates with different usage-based stresses, all of which can eventually generate permanent faults. While the wearout of an individual core in many-core CMPs may not necessarily be catastrophic for the system, a single fault in the inter-processor Network-on-Chip (NoC) fabric could render the entire chip useless, as it could lead to protocol-level deadlocks, or even partition away vital components such as the memory controller or other critical I/O. In this paper, we develop critical path models for HCI- and NBTI-induced wear due to the actual stresses caused by real workloads, applied onto the interconnect micro-architecture. A key finding from this modeling being that, counter to prevailing wisdom, wearout in the CMP on-chip interconnect is correlated with lack of load observed in the NoC routers, rather than high load. We then develop a novel wearout-decelerating scheme in which routers under low load have their wearout-sensitive components exercised, without significantly impacting cycle time, pipeline depth, area or power consumption of the overall router. We subsequently show that the proposed design yields a 13.8×-65× increase in CMP lifetime.

[1]  Tao Li,et al.  Architecting reliable multi-core network-on-chip for small scale processing technology , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[2]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[3]  Sanghamitra Roy,et al.  Towards graceful aging degradation in NoCs through an adaptive routing algorithm , 2012, DAC Design Automation Conference 2012.

[4]  Sani R. Nassif,et al.  High Performance CMOS Variability in the 65nm Regime and Beyond , 2007 .

[5]  Sachin S. Sapatnekar,et al.  Impact of NBTI on SRAM read stability and design for reliability , 2006, 7th International Symposium on Quality Electronic Design (ISQED'06).

[6]  Chris Auth,et al.  22-nm fully-depleted tri-gate CMOS transistors , 2012, Proceedings of the IEEE 2012 Custom Integrated Circuits Conference.

[7]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[8]  K.J. Kuhn,et al.  Reducing Variation in Advanced Logic Technologies: Approaches to Process and Design for Manufacturability of Nanoscale CMOS , 2007, 2007 IEEE International Electron Devices Meeting.

[9]  T. Numata,et al.  Performance, variability and reliability of silicon tri-gate nanowire MOSFETs , 2012, 2012 IEEE International Reliability Physics Symposium (IRPS).

[10]  Weidong Liu,et al.  An accurate and scalable MOSFET aging model for circuit simulation , 2011, 2011 12th International Symposium on Quality Electronic Design.

[11]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Ming Zhang,et al.  Circuit Failure Prediction and Its Application to Transistor Aging , 2007, 25th IEEE VLSI Test Symposium (VTS'07).

[13]  Alain Greiner,et al.  A reconfigurable routing algorithm for a fault-tolerant 2D-Mesh Network-on-Chip , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[14]  Sanghamitra Roy,et al.  An MILP-based aging-aware routing algorithm for NoCs , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Qiang Xu,et al.  AgeSim: A simulation framework for evaluating the lifetime reliability of processor-based SoCs , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[16]  A. R. Newton,et al.  Alpha-power law MOSFET model and its applications to CMOS inverter delay and other formulas , 1990 .

[17]  Fan Yang,et al.  Statistical reliability analysis under process variation and aging effects , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[18]  Pradip Bose,et al.  A Framework for Architecture-Level Lifetime Reliability Modeling , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[19]  Mehdi Baradaran Tahoori,et al.  ExtraTime: Modeling and analysis of wearout due to transistor aging at microarchitecture-level , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[20]  Li-Shiuan Peh,et al.  DRAIN: Distributed Recovery Architecture for Inaccessible Nodes in multi-core chips , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[21]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Jaume Abella,et al.  Penelope: The NBTI-Aware Processor , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[23]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[24]  Saurabh Dighe,et al.  A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling , 2011, IEEE Journal of Solid-State Circuits.

[25]  Sorin Cotofana,et al.  A unified aging model of NBTI and HCI degradation towards lifetime reliability management for nanoscale MOSFET circuits , 2011, 2011 IEEE/ACM International Symposium on Nanoscale Architectures.

[26]  William J. Dally,et al.  A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[27]  Wolfgang Rosenstiel,et al.  Fully Adaptive Fault-Tolerant Routing Algorithm for Network-on-Chip Architectures , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[28]  Babak Falsafi,et al.  Detecting Emerging Wearout Faults , 2007 .

[29]  Josep Torrellas,et al.  The BubbleWrap many-core: Popping cores for sequential acceleration , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Georges G. E. Gielen,et al.  A methodology for measuring transistor ageing effects towards accurate reliability simulation , 2009, 2009 15th IEEE International On-Line Testing Symposium.

[31]  Subhasish Mitra,et al.  CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns , 2008, 2008 Design, Automation and Test in Europe.

[32]  David Blaauw,et al.  A highly resilient routing algorithm for fault-tolerant NoCs , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[33]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[34]  Erika Gunadi,et al.  Combating Aging with the Colt Duty Cycle Equalizer , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[35]  Xiaojun Li,et al.  Compact Modeling of MOSFET Wearout Mechanisms for Circuit-Reliability Simulation , 2008, IEEE Transactions on Device and Materials Reliability.