Towards a Theory of Replicated Processing

In the N-Modular Redundancy (NMR) approach, a computation is made reliable by executing it on several computers, and determining its results by a decision algorithm. This paper investigates a formal approach to the use of NMR in replicated distributed systems, for which it introduces a notion of correctness based on consistency with their non-replicated counterpart, and a local correctness criterion. We discuss how a replicated system component may be implemented by N base copies, a majority of which is non-faulty. The formal approach sheds light on the necessity of coordinating the copies and on the requirements they should satisfy; in particular the difficulty of replicating synchronous communication is pointed out. A practical approach is also briefly examined and shown to be consistent with the formal model. Inside every replicated system there is a non-replicated system trying to get out.

[1]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[2]  Luigi V. Mancini Modular redundancy in a message passing system , 1986, IEEE Transactions on Software Engineering.

[3]  Luigi V. Mancini,et al.  Formal specification of N-modular redundancy , 1986, CSC '86.

[4]  Luigi V. Mancini,et al.  Proving Correctness Properties of a Replicated Synchronous Program , 1989, Comput. J..

[5]  Leslie Lamport,et al.  The Implementation of Reliable Distributed Multiprocess Systems , 1978, Comput. Networks.

[6]  Richard S. Bird The promotion and accumulation strategies in transformational programming , 1984, TOPL.

[7]  P. M. Melliar-Smith,et al.  Formal Specification and Mechanical Verification of SIFT: A Fault-Tolerant Flight Control System , 1982, IEEE Transactions on Computers.

[8]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[9]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[10]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[11]  Jack Goldberg,et al.  SIFT: A Provable Fault-Tolerant Computer for Aircraft Flight Control , 1980, IFIP Congress.

[12]  L. Mancini,et al.  The Join Algorithm: Ordering Messages in Replicated Systems , 1986 .

[13]  Eric C. Cooper Replicated distributed programs , 1985, SOSP '85.

[14]  Luigi V. Mancini,et al.  Synchronizing events in replicated systems , 1989, J. Syst. Softw..

[15]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[16]  Fred B. Schneider,et al.  Synchronization in Distributed Programs , 1982, TOPL.

[17]  Santosh K. Shrivastava,et al.  Exception Handling in Replicated Systems with Voting , 1986 .

[18]  C. A. R. Hoare,et al.  Communicating Sequential Processes (Reprint) , 1983, Commun. ACM.