Online error detection and correction of erratic bits in register files

Aggressive voltage scaling needed for low power in each new process generation causes large deviations in the threshold voltage of minimally sized devices of the 6T SRAM cell. Gate oxide scaling can cause large transient gate leakage (a trap in the gate oxide), which is known as the erratic bits phenomena. Register file protection is necessary to prevent errors from quickly spreading to different parts of the system, which may cause applications to crash or silent data corruption. This paper proposes a simple and cost-effective mechanism that increases the resiliency of the register files to erratic bits. Our mechanism detects those registers that have erratic bits, recovers from the error and quarantines the faulty register. After the quarantine period, it is able to detect whether they are fully operational with low overhead.

[1]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[2]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[3]  Wei Liu,et al.  Using Register Lifetime Predictions to Protect Register Files Against Soft Errors , 2008 .

[4]  Mahmut T. Kandemir,et al.  Increasing register file immunity to transient errors , 2005, Design, Automation and Test in Europe.

[5]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[6]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[7]  Kaushik Roy,et al.  A 160 mV, fully differential, robust schmitt trigger based sub-threshold SRAM , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[8]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[9]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[10]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[11]  J. Jopling,et al.  Erratic fluctuations of sram cache vmin at the 90nm process technology node , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[12]  Israel Koren,et al.  Defect tolerance in VLSI circuits: techniques and yield analysis , 1998, Proc. IEEE.

[13]  Francky Catthoor,et al.  Guest Editors' Intoduction: The New World of Large Embedded Memories , 2001, IEEE Des. Test Comput..

[14]  David I. August,et al.  Design and evaluation of hybrid fault-detection systems , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15]  Diana Marculescu,et al.  Power efficiency of voltage scaling in multiple clock, multiple voltage cores , 2002, ICCAD 2002.

[16]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[17]  Sandhya Dwarkadas,et al.  Dynamic frequency and voltage control for a multiple clock domain microarchitecture , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[18]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[19]  E.S. Fetzer,et al.  The Parity protected, multithreaded register files on the 90-nm itanium microprocessor , 2006, IEEE Journal of Solid-State Circuits.

[20]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[21]  Rochit Rajsuman Design and Test of Large Embedded Memories: An Overview , 2001, IEEE Des. Test Comput..

[22]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[23]  G. Atwood,et al.  Erratic Erase In ETOX/sup TM/ Flash Memory Array , 1993, Symposium 1993 on VLSI Technology.