Overcoming Early-Life Failure and Aging for Robust Systems

The prospect of system failure has increased because of device and chip-level effects in the late CMOS era. In this article, the authors present novel system-level architecture and design innovations to cope with these lifetime reliability challenges. At nanometer-scale geometries, several hardware failure mechanisms, which were largely benign in the past, are becoming visible at the system level. Moreover, recent studies indicate that, depending on the application, hardware failures can be significant contributors to overall system failure rates.Design of robust systems ensuring required hardware reliability, although nontrivial, is achievable but at high costs. Concurrent error detection during system operation is an extremely important aspect of such systems.Hardware reliability challenges arise from three major sources: early-life failures (also called infant mortality), radiation-induced soft errors, and circuit aging. Several techniques, such as Built-in Soft-Error Resilience (BISER), can be effectively used for correcting radiation-induced transient (soft) errors. Focus on early-life failures (ELF) and circuit aging was discussed. These techniques utilize specific characteristics of reliability mechanisms without incurring the high costs of traditional concurrent error detection.

[1]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[2]  Yu Cao,et al.  Circuit aging prediction for low-power operation , 2009, 2009 IEEE Custom Integrated Circuits Conference.

[3]  Hiroaki Inoue,et al.  VAST: Virtualization-Assisted Concurrent Autonomous Self-Test , 2008, 2008 IEEE International Test Conference.

[4]  P.N. Sanda,et al.  IBM z990 soft error detection and recovery , 2005, IEEE Transactions on Device and Materials Reliability.

[5]  Masayuki Mizuno,et al.  Experimental study of gate oxide early-life failures , 2009, 2009 IEEE International Reliability Physics Symposium.

[6]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[7]  Subhasish Mitra,et al.  CASP: Concurrent Autonomous Chip Self-Test Using Stored Test Patterns , 2008, 2008 Design, Automation and Test in Europe.

[8]  Subhasish Mitra,et al.  Testing for Transistor Aging , 2009, 2009 27th IEEE VLSI Test Symposium.

[9]  Thomas J. Anderson,et al.  The impact of multiple failure modes on estimating product field reliability , 2006, IEEE Design & Test of Computers.

[10]  BorkarShekhar Designing Reliable Systems from Unreliable Components , 2005 .

[11]  N. Seifert,et al.  Robust system design with built-in soft-error resilience , 2005, Computer.

[12]  Mehdi Baradaran Tahoori,et al.  A Field Analysis of System-level Effects of Soft Errors Occurring in Microprocessors used in Information Systems , 2008, 2008 IEEE International Test Conference.

[13]  Ming Zhang,et al.  Circuit Failure Prediction and Its Application to Transistor Aging , 2007, 25th IEEE VLSI Test Symposium (VTS'07).

[14]  Edward J. McCluskey,et al.  Which concurrent error detection scheme to choose ? , 2000, Proceedings International Test Conference 2000 (IEEE Cat. No.00CH37159).