TERPS: the embedded reliable processing system

TERPS is a fault-tolerant computer design that significantly reduces the threat of electromagnetic interference (EMI), using hardware checkpoint/rollback-recovery. TERPS tolerates EMI by periodically checkpointing processor state into a special safe-storage device. The detection of EMI invokes rollback, which recovers processor state from a previously check-pointed state and resumes normal execution. Rollback results in loss of performance dictated by the EMI duration; TERPS ensures forward progress of the system provided EMI events are separated by some minimum time interval (e.g., at least 5.12/spl mu/s for our prototype processor running at 100MHz). The performance overhead of our mechanism is reasonable: 5-6% overhead when checkpointing every 128 processor cycles.

[1]  Gurindar S. Sohi,et al.  Instruction issue logic for high-performance, interruptable pipelined processors , 1987, ISCA '98.

[2]  Janak H. Patel,et al.  Error Recovery in Shared Memory Multiprocessors Using Private Caches , 1990, IEEE Trans. Parallel Distributed Syst..

[3]  Etienne Sicard,et al.  Characterisation of microcontroller susceptibility to radio frequency interference , 2002, Proceedings of the Fourth IEEE International Caracas Conference on Devices, Circuits and Systems (Cat. No.02TH8611).

[4]  D. Kenneally,et al.  RF upset susceptibilities of CMOS and low power Schottky D-type flip-flops , 1989, National Symposium on Electromagnetic Compatibility.

[5]  W. Kent Fuchs,et al.  Compiler-Assisted Multiple Instruction Word Retry for VLIW Architectures , 2001, IEEE Trans. Parallel Distributed Syst..

[6]  Manoj Franklin Incorporating fault tolerance in superscalar processors , 1996, Proceedings of 3rd International Conference on High Performance Computing (HiPC).

[7]  Ram Chillarege,et al.  Design for fault-tolerance in system ES model 900 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[8]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[9]  C. V. Ramamoorthy,et al.  Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.

[10]  W. Kent Fuchs,et al.  Compiler-Based Multiple Instruction Retry , 1995, IEEE Trans. Computers.

[11]  Yale N. Patt,et al.  Checkpoint repair for out-of-order execution machines , 1987, ISCA '87.

[12]  Xu Xi,et al.  Linking with light , 2004 .

[13]  Franco Fiori,et al.  Integrated circuit susceptibility to conducted RF Interference , 2001 .

[14]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach (4. ed.) , 2007 .

[15]  Douglas Sanders Designing a PC with the DECchip 21066 , 1994, Proceedings of COMPCON '94.

[16]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[17]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[18]  Dhiraj K. Pradhan,et al.  Virtual Checkpoints: Architecture and Performance , 1992, IEEE Trans. Computers.

[19]  Etienne Sicard,et al.  Electromagnetic compatibility of integrated circuits , 2004, Microelectron. J..

[20]  Prithviraj Banerjee,et al.  Fault tolerant VLSI systems , 1993 .

[21]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[22]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[23]  Franco Fiori,et al.  Investigation on VLSIs' input ports susceptibility to conducted RF interference , 1997, IEEE 1997, EMC, Austin Style. IEEE 1997 International Symposium on Electromagnetic Compatibility. Symposium Record (Cat. No.97CH36113).

[24]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[25]  Algirdas Avizienis,et al.  The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design , 1971, IEEE Transactions on Computers.

[26]  J.-F. Luy,et al.  Millimeter wave transmitter and receiver circuits on high resistivity silicon , 1988 .

[27]  Marc Tremblay,et al.  High-Performance Fault-Tolerant VLSI Systems Using Micro Rollback , 1990, IEEE Trans. Computers.

[28]  Marc Tremblay,et al.  The UCLA mirror processor: a building block for self-checking self-repairing computing nodes , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[29]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[30]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[31]  Kewal K. Saluja,et al.  A watchdog processor based general rollback technique with multiple retries , 1986, IEEE Transactions on Software Engineering.

[32]  Mona E. Zaghloul,et al.  CMOS foundry implementation of Schottky diodes for RF detection , 1996 .

[33]  H. Honda,et al.  A 500 MHz pipelined burst SRAM with improved SER immunity , 1999, 1999 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC. First Edition (Cat. No.99CH36278).

[34]  E. Seevinck,et al.  Static-noise margin analysis of MOS SRAM cells , 1987 .

[35]  Marc Tremblay,et al.  The implementation and application of micro rollback in fault-tolerant VLSI systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[36]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[37]  K. Ishibashi,et al.  An alpha -immune, 2-V supply voltage SRAM using a polysilicon PMOS load cell , 1990 .

[38]  Kaustav Banerjee,et al.  3-D ICs: a novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration , 2001, Proc. IEEE.

[39]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[40]  W. Kent Fuchs,et al.  The Performance of Cache-Based Error Recovery in Multiprocessors , 1994, IEEE Trans. Parallel Distributed Syst..

[41]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[42]  Arun K. Somani,et al.  REESE: a method of soft error detection in microprocessors , 2001, 2001 International Conference on Dependable Systems and Networks.

[43]  W. Kent Fuchs,et al.  Compiler-Assisted Multiple Instruction Rollback Recovery Using a Read Buffer , 1993, IEEE Trans. Computers.