Enhancing the fault-tolerance of nonmasking programs

In this paper we focus on automated techniques to enhance the fault-tolerance of a nonmasking fault-tolerant program to masking. A masking program continually satisfies its specification even if faults occur. By contrast, a nonmasking program merely guarantees that after faults stop occurring, the program recovers to states from where it continually satisfies its specification. Until the recovery is complete, however a nonmasking program can violate its (safety) specification. Thus, the problem of enhancing fault-tolerance from nonmasking to masking requires that safety be added and recovery be preserved. We focus on this enhancement problem for high atomicity programs-where each process can read all variables-and for distributed programs-where restrictions are imposed on what processes can read and write. We present a sound and complete algorithm for high atomicity programs and a sound algorithm for distributed programs. We also argue that our algorithms are simpler than previous algorithms, where masking fault-tolerance is added to a fault-intolerant program. Hence, these algorithms can partially reap the benefits of automation when the cost of adding masking fault-tolerance to a fault-intolerant program is high. To illustrate these algorithms, we show how the masking fault-tolerant programs for triple modular redundancy and Byzantine agreement can be obtained by enhancing the fault-tolerance of the corresponding nonmasking versions. We also discuss how the derivation of these programs is simplified when we begin with a nonmasking fault-tolerant program.

[1]  Paul C. Attie,et al.  Synthesis of concurrent systems for an atomic read/atomic write model of computation , 1996, PODC '96.

[2]  Ali Ebnenasir,et al.  The complexity of adding failsafe fault-tolerance , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[3]  Bowen Alpern,et al.  Defining Liveness , 1984, Inf. Process. Lett..

[4]  Anish Arora,et al.  Synthesis of fault-tolerant concurrent programs , 2004, TOPL.

[5]  Anish Arora,et al.  Polynomial time synthesis of Byzantine agreement , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.

[6]  Anish Arora,et al.  Component based design of fault-tolerance , 1999 .

[7]  Anish Arora,et al.  Automating the Addition of Fault-Tolerance , 2000, FTRTFT.

[8]  Anish Arora,et al.  Designing Masking Fault-Tolerance via Nonmasking Fault-Tolerance , 1998, IEEE Trans. Software Eng..

[9]  Edmund M. Clarke,et al.  Using Branching Time Temporal Logic to Synthesize Synchronization Skeletons , 1982, Sci. Comput. Program..

[10]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[11]  Anish Arora,et al.  Closure and Convergence: A Foundation of Fault-Tolerant Computing , 1993, IEEE Trans. Software Eng..