Software Implemented Fault Tolerance Technologies and Experience

By software implemented fault tolerance, we mean a set of software facilities to detect ‘and recover from faults that are are not handled by the underlying hardware or operating system. We consider those faults that cause an application process to crash or hang; they include software faults as well as faults in the underlying hardware and operating system layers if they are undetected in those layers. We define 4 levels of software fault tolerance based on availability and data consistency of an application in the presence of such faults. Watchd, libft and nDFS are reusable components that provide up to the 3rd level of software fault tolerance. They perform, respectively, automatic detection and restart of failed processes, periodic checkpointing and recovery of critical volatile data, and replication and synchronization of persistent data in an application software system. These modules have been ported to a number of UNIX’ platforms and can be used by any application with minimal programming egort. Some newer telecommunications products in AT&T have already enhanced their fault-tolerance capability using these three components. Experience with those products to date indicates that these modules provide eficient and economical means to increase the level of fault tolerance in a software product. The performance overhead due to these components depends on the level and varies from 0.1% to 14% based on the amount of critical data being checkpointed and replicated.

[1]  Brian Randell System structure for software fault tolerance , 1975 .

[2]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[3]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[4]  Dhiraj K. Pradhan,et al.  Fault-tolerant computing : theory and techniques , 1986 .

[5]  Dina Bitton,et al.  Disk Shadowing , 1988, VLDB.

[6]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[7]  David G. Korn,et al.  A new dimension for the Unix® file system , 1990, Softw. Pract. Exp..

[8]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[9]  Mary Baker,et al.  The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment , 1992, USENIX Summer.

[10]  Yennun Huang,et al.  Effect of Fault Tolerance on Response Time-Analysis of the Primary Site Approach , 1992, IEEE Trans. Computers.

[11]  Jacob A. Abraham,et al.  Compiler-assisted static checkpoint insertion , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[12]  Y. Huang,et al.  A User-Level Replicated File System , 1993, USENIX Summer.