WearMon: Reliability monitoring using adaptive critical path testing

As processor reliability becomes a first order design constraint, this research argues for a need to provide continuous reliability monitoring. We present WearMon, an adaptive critical path monitoring architecture which provides accurate and real-time measure of the processor's timing margin degradation. Special test patterns check a set of critical paths in the circuit-under-test. By activating the actual devices and signal paths used in normal operation of the chip, each test will capture up-to-date timing margin of these paths. The monitoring architecture dynamically adapts testing interval and complexity based on analysis of prior test results, which increases efficiency and accuracy of monitoring. Experimental results based on FPGA implementation show that the proposed monitoring unit can be easily integrated into existing designs. Monitoring overhead can be reduced to zero by scheduling tests only when a unit is idle.

[1]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  Josep Torrellas,et al.  Facelift: Hiding and slowing down aging in multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[3]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[4]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[5]  Muhammad Ashraful Alam,et al.  A comprehensive model of PMOS NBTI degradation , 2005, Microelectron. Reliab..

[6]  Todd M. Austin,et al.  Ultra low-cost defect protection for microprocessor pipelines , 2006, ASPLOS XII.

[7]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[8]  Edward T. Grochowski,et al.  Implications of Device Timing Variability on Full Chip Timing , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[9]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[10]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[11]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[12]  Margaret Martonosi,et al.  Voltage and frequency control with adaptive reaction time in multiple-clock-domain processors , 2005, 11th International Symposium on High-Performance Computer Architecture.

[13]  Gu-Yeon Wei,et al.  ReVIVaL: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency , 2008, 2008 International Symposium on Computer Architecture.

[14]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[15]  Doris Schmitt-Landsiedel,et al.  A 65 nm test structure for SRAM device variability and NBTI statistics , 2009, ESSDERC 2009.

[16]  S. Rauch,et al.  Review and Reexamination of Reliability Effects Related to NBTI-Induced Statistical Variations , 2007, IEEE Transactions on Device and Materials Reliability.

[17]  Sule Ozev,et al.  A mechanism for online diagnosis of hard faults in microprocessors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[18]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[19]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[20]  Pradip Bose,et al.  A Framework for Architecture-Level Lifetime Reliability Modeling , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[21]  Babak Falsafi,et al.  Detecting Emerging Wearout Faults , 2007 .

[22]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[23]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[24]  Ming Zhang,et al.  Circuit Failure Prediction and Its Application to Transistor Aging , 2007, 25th IEEE VLSI Test Symposium (VTS'07).