Design and Evaluation of Confidence-Driven Error-Resilient Systems

Deeply scaled CMOS circuits are increasingly susceptible to transient faults and soft errors; emerging post-CMOS devices can be more vulnerable, sometimes exhibiting erratic errors of arbitrary duration. Applying timing and supply voltage margin is wasteful and becoming ineffective, and conventional checking and sparing techniques provide only a limited error coverage against widely varying errors. We propose a confidence-driven computing (CDC) model for an adaptive protection against nondeterministic errors. The CDC model employs fine-grained temporal redundancy and confidence checking for a faster adaptation and tunable reliability. The CDC model can be extended to deeply scaled CMOS circuits that are mainly affected by transient faults and soft errors, where an early checking (EC) technique can be used to perform independent error checking for more flexibility and better performance. To evaluate the CDC model, we apply a sample-based field-programmable gate array emulation along with real-time error injection. The CDC model is shown to adapt to fluctuating error rates and enhance the system reliability by effectively trading off performance. To evaluate the EC technique at a finer time scale, we create a new event-based simulation to capture path delay distribution, error model, and their interactions. The EC technique improves the system reliability by more than four orders of magnitude when errors are of short duration. Both the CDC model and the EC technique are synthesized in a 45-nm CMOS technology for cost estimates: 1) the area overhead is as low as 12% and 2) energy overhead can be limited to 19%.

[1]  Todd M. Austin,et al.  CrashTest: A fast high-fidelity FPGA-based resiliency analysis framework , 2008, 2008 IEEE International Conference on Computer Design.

[2]  Melvin A. Breuer,et al.  Defect and error tolerance in the presence of massive numbers of defects , 2004, IEEE Design & Test of Computers.

[3]  Helia Naeimi,et al.  Fault-tolerant sub-lithographic design with rollback recovery. , 2008, Nanotechnology.

[4]  Naresh R. Shanbhag,et al.  Energy-efficient signal processing via algorithmic noise-tolerance , 1999, Proceedings. 1999 International Symposium on Low Power Electronics and Design (Cat. No.99TH8477).

[5]  C. Lopez-Ongil,et al.  Autonomous Fault Emulation: A New FPGA-Based Acceleration System for Hardness Evaluation , 2007, IEEE Transactions on Nuclear Science.

[6]  David Blaauw,et al.  A confidence-driven model for error-resilient computing , 2011, 2011 Design, Automation & Test in Europe.

[7]  David M. Bull,et al.  RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance , 2009, IEEE Journal of Solid-State Circuits.

[8]  Patrick Chiang,et al.  Error detection and recovery techniques for variation-aware CMOS computing: A comprehensive review , 2011 .

[9]  Chen Chang,et al.  BEE3: Revitalizing Computer Architecture Research , 2009 .

[10]  Subhasish Mitra,et al.  ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[11]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[12]  Yuan Taur,et al.  Device scaling limits of Si MOSFETs and their application dependencies , 2001, Proc. IEEE.

[13]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[14]  K. Bernstein,et al.  Scaling, power, and the future of CMOS , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[15]  James D. Meindl,et al.  Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration , 2002, IEEE J. Solid State Circuits.

[16]  Douglas L. Jones,et al.  Scalable stochastic processors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[17]  Kaustav Banerjee,et al.  Analysis of IR-drop scaling with implications for deep submicron P/G network designs , 2003, Fourth International Symposium on Quality Electronic Design, 2003. Proceedings..

[18]  Naresh R. Shanbhag,et al.  Sequential Element Design With Built-In Soft Error Resilience , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[19]  R. Stanley Williams,et al.  Molecular dynamics simulations of oxide memristors: Crystal field effects , 2011, 1105.4445.

[20]  Shekhar Y. Borkar,et al.  Design perspectives on 22nm CMOS and beyond , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[21]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[22]  Hagbae Kim,et al.  A Time Redundancy Approach to TMR Failures Using Fault-State Likelihoods , 1994, IEEE Trans. Computers.

[23]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[24]  Robert C. Aitken,et al.  TIMBER: Time borrowing and error relaying for online timing error resilience , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[25]  W. Lu,et al.  Programmable Resistance Switching in Nanoscale Two-terminal Devices , 2008 .

[26]  A. S. Sadek,et al.  Fault-tolerant techniques for nanocomputers , 2002 .

[27]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..