Replica determinism in distributed real-time systems: A brief survey

Replication of entities is a convenient technique to achieve fault-tolerance. The problem of replica determinism thereby is to assure, that replicated entities show consistent behavior in the absence of failures. Possible sources for replica non-determinism as well as basic requirements and strategies to enforce replica determinism are presented. The problem of replica determinism enforcement under real-time constraints is surveyed in the context of the communication problem for distributed systems. Furthermore the close interdependence between replica determinism on the one side and synchronization strategies, handling of failures and redundancy preservation on the other side is reviewed. The impact of synchronous or asynchronous approaches on replication strategies is also discussed.

[1]  Andrew S. Tanenbaum,et al.  Group communication in the Amoeba distributed operating system , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[2]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[3]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[4]  John P. Lehoczky,et al.  Task Scheduling In Distributed Real-Time Systems , 1987, Other Conferences.

[5]  LamportLeslie Time, clocks, and the ordering of events in a distributed system , 1978 .

[6]  Sam Toueg,et al.  Unreliable Failure Detectors for Asynchronous Systems , 1991 .

[7]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[8]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[9]  Brian A. Coan,et al.  Simultaneity Is Harder than Agreement , 1991, Inf. Comput..

[10]  Sam Toueg,et al.  Unreliable failure detectors for asynchronous systems (preliminary version) , 1991, PODC '91.

[11]  Johannes Reisinger Failure Modes and Failure Characteristics of a TDMA driven Ethernet , 1989 .

[12]  Flaviu Cristian,et al.  Exception Handling , 1989 .

[13]  Sam Toueg,et al.  Early-Stopping Distributed Bidding and Applications (Preliminary Version) , 1990, WDAG.

[14]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[15]  Fred B. Schneider,et al.  Primary-Backup Protocols: Lower Bounds and Optimal Implementations , 1992 .

[16]  Özalp Babaoglu,et al.  Communication Architechtures for Fast Reliable Broadcasts , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[17]  Nancy G. Leveson,et al.  The Consistent Comparison Problem in N-Version Software , 1989, IEEE Trans. Software Eng..

[18]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.

[19]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[20]  Danny Dolev,et al.  Authenticated Algorithms for Byzantine Agreement , 1983, SIAM J. Comput..

[21]  Özalp Babaoglu,et al.  Reliable broadcasts and communication models: tradeoffs and lower bounds , 1988, Distributed Computing.

[22]  Partha Dasgupta,et al.  Fault Tolerant Computing in Object Based Distributed Operating Systems , 1987, SRDS.

[23]  Baruch Awerbuch,et al.  Reliable broadcast protocols in unreliable networks , 1986, Networks.

[24]  Santosh K. Shrivastava,et al.  Preventing state divergence in replicated distributed programs , 1990, Proceedings Ninth Symposium on Reliable Distributed Systems.

[25]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[26]  Ferranti Computer Systems Limited,et al.  THE DELTA-4 EXTRA PERFORMANCE ARCHITECTURE (XPA) , 1990 .

[27]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[28]  Luigi V. Mancini,et al.  Formalising replicated distributed processing , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[29]  P. Reynier,et al.  Active replication in Delta-4 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[30]  P. M. Melliar-Smith,et al.  Synchronizing clocks in the presence of faults , 1985, JACM.

[31]  Flaviu Cristian,et al.  Agreeing on who is present and who is absent in a synchronous distributed system , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[32]  Jiannong Cao,et al.  An abstract model of rollback recovery control in distributed systems , 1992, OPSR.

[33]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[34]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[35]  Robbert van Renesse,et al.  Experiences with the Amoeba distributed operating system , 1990, CACM.

[36]  Paulo Veríssimo,et al.  AMp: a highly parallel atomic multicast protocol , 1989, SIGCOMM 1989.

[37]  Luigi V. Mancini,et al.  Towards a Theory of Replicated Processing , 1988, FTRTFT.

[38]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[39]  Peter Alan Lee,et al.  Fault Tolerance , 1990, Dependable Computing and Fault-Tolerant Systems.

[40]  Amr Elabbadi Implementing Fault-Tolerant Distributed Objects , 1985 .

[41]  Danny Dolev,et al.  The Byzantine Generals Strike Again , 1981, J. Algorithms.

[42]  Hermann Kopetz,et al.  Dependability: Basic Concepts and Terminology , 1992 .

[43]  K. H. Kim,et al.  Temporal uncertainties in interactions among real-time objects , 1990, Proceedings Ninth Symposium on Reliable Distributed Systems.

[44]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[45]  Nancy A. Lynch,et al.  A Lower Bound for the Time to Assure Interactive Consistency , 1982, Inf. Process. Lett..

[46]  Hector Garcia-Molina,et al.  Message ordering in a multicast environment , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[47]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[48]  Flaviu Cristian,et al.  Synchronous atomic broadcast for redundant broadcast channels , 1990, Real-Time Systems.

[49]  Danny Dolev,et al.  Early stopping in Byzantine agreement , 1990, JACM.

[50]  Hermann Kopetz,et al.  Fault-Tolerant Membership Service in a Synchronous Distributed Real-Time System , 1991 .

[51]  Hermann Kopetz,et al.  Tolerating transient faults in MARS , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[52]  D. L. Palumbo,et al.  Measurement of SIFT operating system overhead , 1985 .

[53]  Paulo Veríssimo,et al.  AMp: a highly parallel atomic multicast protocol , 1989, SIGCOMM '89.

[54]  Nancy A. Lynch,et al.  The Byzantine Firing Squad Problem. , 1985 .

[55]  Hermann Kopetz,et al.  Sparse time versus dense time in distributed real-time systems , 1992, [1992] Proceedings of the 12th International Conference on Distributed Computing Systems.

[56]  Chris J. Walter,et al.  The MAFT Architecture for Distributed Fault Tolerance , 1988, IEEE Trans. Computers.

[57]  P.M. Melliar-Smith,et al.  Fault-tolerant distributed systems based on broadcast communication , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[58]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[59]  Eric C. Cooper Circus: A Replicated Procedure Call Facility , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[60]  Sam Toueg,et al.  Inconsistency and contamination (preliminary version) , 1991, PODC '91.

[61]  Hermann Kopetz,et al.  Clock Synchronization in Distributed Real-Time Systems , 1987, IEEE Transactions on Computers.

[62]  Keith Marzullo,et al.  Tolerating failures of continuous-valued sensors , 1990, TOCS.

[63]  Anant Agarwal,et al.  Scalability of parallel machines , 1991, CACM.

[64]  Geneva G. Belford,et al.  Consistent replicated transactions: a highly reliable program execution environment , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[65]  Paulo Veríssimo Real-time data management with clock-less reliable broadcast protocols , 1990, [1990] Proceedings. Workshop on the Management of Replicated Data.

[66]  Philip S. Yu,et al.  Divergence control for epsilon-serializability , 1992, [1992] Eighth International Conference on Data Engineering.

[67]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[68]  Sam Toueg,et al.  Fast Distributed Agreement , 1987, SIAM J. Comput..

[69]  Shivakant Mishra,et al.  Implementing fault-tolerant replicated objects using Psync , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[70]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[71]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[72]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.