Software-Based Fault Tolerant Computing

This paper describes how to design a software-based fault tolerant application using microprocessor (MP), in order to tolerate the burst errors in memory. This approach may be called a single -- version scheme (SVS). The SVS relies on a single version application program which is enhanced with self-checking code redundancy to tolerate memory burst errors that are difficult to correct during the run-time of an application. Conventionally, the other software based approaches can detect a few bit errors (in memory) only towards fail-stop kind of fault tolerance against transient bit errors. Reed Solomon codes are mainly effective for burst errors in coding of audio Compact Disks at offline only. The proposed online technique does not need multiple versions of software and multiple machines. This approach employs only two copies of the application software running on one machine only. Two copies of the enhanced version version of an application are used here for online error detection and tolerance thereof as well. This is an effective low-cost online tool for hardening a microprocessor-based industrial computing system or for on-chip DRAM applications using an affordable code and time redundancy against the burst errors in processor memory. The SVS aims to provide a non-fail-stop kind of fault tolerance against burst errors. This approach supplements the Error Correcting Codes (ECC) in memory system also, against both the transient and permanent bit errors in memory.

[1]  Goutam Kumar Saha Software implemented fault tolerance through data error recovery , 2005, UBIQ.

[2]  Yervant Zorian,et al.  Guest Editors' Introduction: Design for Yield and Reliability , 2004, IEEE Des. Test Comput..

[3]  Giovanni De Micheli,et al.  Synthesis and simulation of digital systems containing interacting hardware and software components , 1992, [1992] Proceedings 29th ACM/IEEE Design Automation Conference.

[4]  A. Benso,et al.  An integrated HW and SW fault injection environment for real-time systems , 1998, Proceedings 1998 IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (Cat. No.98EX223).

[5]  Henrique Madeira,et al.  Experimental evaluation of the fail-silent behaviour in programs with consistency checks , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[6]  Giovanni De Micheli,et al.  Program implementation schemes for hardware-software systems , 1994, Computer.

[7]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[8]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[9]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[10]  Pradip Bose,et al.  Performance Analysis and Its Impact on Design , 1998, Computer.

[11]  S. Wicker Error Control Systems for Digital Communication and Storage , 1994 .

[12]  Goutam K. Saha Transient fault tolerant processing in a RF application , 2000 .

[13]  Brian Randell Design Fault Tolerance , 1986 .

[14]  Stephen S. Yau,et al.  An Approach to Concurrent Control Flow Checking , 1980, IEEE Transactions on Software Engineering.

[15]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications , 1989, IEEE Trans. Computers.