A Self-Checking Hardware Journal for a Fault-Tolerant Processor Architecture

We introduce a specialized self-checking hardware journal being used as a centerpiece in our design strategy to build a processor tolerant to transient faults. Fault tolerance here relies on the use of error detection techniques in the processor core together with journalization and rollback execution to recover from erroneous situations. Effective rollback recovery is possible thanks to using a hardware journal and chosing a stack computing architecture for the processor core instead of the usual RISC or CISC. The main objective of the journalization and the hardware self-checking journal is to prevent data not yet validated to be sent to the main memory, and allow to fast rollback execution on faulty situations. The main memory, supposed to be fault secure in our model, only contains valid (uncorrupted) data obtained from fault-free computations. Error control coding techniques are used both in the processor core to detect errors and in the HW journal to protect the temporarily stored data from possible changes induced by transient faults. Implementation results on an FPGA of the Altera Stratix-II family show clearly the relevance of the approach, both in terms of performance/area tradeoff and fault tolerance effectiveness, even for high error rates.

[1]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .

[2]  Fabrice Monteiro,et al.  A HW/SW mixed mechanism to improve the dependability of a stack processor , 2009, 2009 16th IEEE International Conference on Electronics, Circuits and Systems - (ICECS 2009).

[3]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[4]  Mahmut T. Kandemir,et al.  Reliability-Aware Co-Synthesis for Embedded Systems , 2004, ASAP.

[5]  Erik G. Larsson,et al.  Fault-Tolerant Average Execution Time Optimization for System-On-Chips , 2009 .

[6]  Jie Xu,et al.  Roll-forward error recovery in embedded real-time systems , 1996, Proceedings of 1996 International Conference on Parallel and Distributed Systems.

[7]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[8]  Ying Zhang,et al.  A unified approach for fault tolerance and dynamic power management in fixed-priority real-time embedded systems , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[9]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[10]  Dennis McEvoy The architecture of Tandem's NonStop system , 1981, ACM '81.

[11]  Edward A. Lee,et al.  Heterogeneous Simulation—Mixing Discrete-Event Models with Dataflow , 1997, J. VLSI Signal Process..

[12]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[13]  Jr. Philip J. Koopman,et al.  Stack computers: the new wave , 1989 .

[14]  Jean Arlat,et al.  Fault Injection and Dependability Evaluation of Fault-Tolerant Systems , 1993, IEEE Trans. Computers.

[15]  Niraj K. Jha,et al.  COFTA : Hardware-Software Co-Synthesis of Heterogeneous Distributed Embedded Systems for Low Overhead Fault Tolerance , 1999 .

[16]  Shohaib Aboobacker RAZOR: circuit-level correction of timing errors for low-power operation , 2011 .

[17]  Fabrice Monteiro,et al.  A fault tolerant journalized stack processor architecture , 2009, 2009 15th IEEE International On-Line Testing Symposium.

[18]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[19]  Nagarajan Kandasamy,et al.  Transparent recovery from intermittent faults in time-triggered distributed systems , 2003 .

[20]  Dhiraj K. Pradhan,et al.  Fault-tolerant computer system design , 1996 .

[21]  Dhiraj K. Pradhan,et al.  Fault Injection: A Method for Validating Computer-System Dependability , 1995, Computer.

[22]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[23]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[24]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[25]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[26]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[27]  Alan Burns,et al.  Analysis of Checkpointing for Real-Time Systems , 2004, Real-Time Systems.

[28]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[29]  Hermann Kopetz,et al.  Tolerating transient faults in MARS , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[30]  Isabelle Puaut,et al.  Scheduling fault-tolerant distributed hard real-time tasks independently of the replication strategies , 1999, Proceedings Sixth International Conference on Real-Time Computing Systems and Applications. RTCSA'99 (Cat. No.PR00306).

[31]  Rami G. Melhem,et al.  Tolerance to Multiple Transient Faults for Aperiodic Tasks in Hard Real-Time Systems , 2000, IEEE Trans. Computers.

[32]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[33]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[34]  Dhiraj K. Pradhan,et al.  Virtual Checkpoints: Architecture and Performance , 1992, IEEE Trans. Computers.