A timeout-based message ordering protocol for a lightweight software implementation of TMR systems

Replicated processing with majority voting is a well-known method for achieving reliability and availability. Triple modular redundant (TMR) processing is the most commonly used version of that method. Replicated processing requires that the replicas reach agreement on the order in which input requests are to be processed. Almost all synchronous and deterministic ordering protocols published in the literature are time-based in the sense that they require replicas' clocks to be kept synchronized within some known bound. We present a protocol for TMR systems that is based on timeouts and does not require clocks to be kept in bounded synchronism. Our design efforts focus on keeping the ordering delays small, without an unnecessary increase in message overhead. Consequently, we are able to show that no symmetric protocol that works only with unsynchronized clocks can provide a smaller worst-case delay. We also demonstrate through analysis and experiments that our protocol is faster than a time-based one of identical message complexity in certain situations which can prevail in many application settings.

[1]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[2]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[3]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[4]  Paul D. Ezhilchelvan,et al.  Early Stopping Algorithms for Distributed Agreement under Fail-Stop, Omission, and Timing Fault-Types , 1987, SRDS.

[5]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[6]  Hermann Kopetz,et al.  Clock Synchronization in Distributed Real-Time Systems , 1987, IEEE Transactions on Computers.

[7]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[8]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[9]  Sam Toueg,et al.  Optimal clock synchronization , 1985, PODC '85.

[10]  Paul D. Ezhilchelvan,et al.  Principal Features of the VOLTAN Family of Reliable Node Architectures for Distributed Systems , 1992, IEEE Trans. Computers.

[11]  Danny Dolev,et al.  Requirements for Agreement in a Distributed System , 1982, DDB.

[12]  Flaviu Cristian,et al.  Fault-tolerance in air traffic control systems , 1996, TOCS.

[13]  Danny Dolev,et al.  Fault-tolerant clock synchronization , 1984, PODC '84.

[14]  Antonio Casimiro,et al.  CesiumSpray: a Precise and Accurate Global Time Service for Large-scale Systems , 1997, Real-Time Systems.

[15]  Peter N. Marinos,et al.  Synchronization of Fault-Tolerant Clocks in the Presence of Malicious Failures , 1988, IEEE Trans. Computers.

[16]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[17]  Paul D. Ezhilchelvan,et al.  TMR processing without explicit clock synchronisation , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[18]  Danny Dolev,et al.  On the Possibility and Impossibility of Achieving Clock Synchronization , 1986, J. Comput. Syst. Sci..

[19]  P. M. Melliar-Smith,et al.  Synchronizing clocks in the presence of faults , 1985, JACM.

[20]  Flaviu Cristian,et al.  Fail-awareness: an approach to construct fail-safe applications , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[21]  Corporate Inmos Limited Transputer instruction set - a compiler writer's guide , 1988 .

[22]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1999, IEEE Trans. Parallel Distributed Syst..

[23]  Klaus Echtle Fault Masking and Sequence Agreement by a Voting Protocol with low Message Number , 1987, SRDS.

[24]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[25]  Danny Dolev,et al.  Early stopping in Byzantine agreement , 1990, JACM.

[26]  Paul D. Ezhilchelvan,et al.  The Design and Implementation of Voltan Fault-tolerant Nodes for Distributed Systems , 1993 .

[27]  D. Powell,et al.  The Delta-4 Approach to Dependability in Open Distributed Computing Systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[28]  David Powell The Atomic Multicast protocol (AMp) , 1991 .

[29]  Flaviu Cristian,et al.  Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[30]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[31]  Michel Raynal,et al.  Time in Distributed System Models and Algorithms , 1999, Advances in Distributed Systems.

[32]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[33]  Paulo Veríssimo Causal delivery protocols in real-time systems: A generic model , 2004, Real-Time Systems.

[34]  Gene Tsudik Message authentication with one-way hash functions , 1992, CCRV.

[35]  Flaviu Cristian,et al.  Continuous clock amortization need not affect the precision of a clock synchronization algorithm , 1990, PODC '90.