Consistent replicated transactions: a highly reliable program execution environment

A highly reliable program execution environment which enables user programs to tolerate underlying hardware failures is presented. The approach is to run multiple copies of the user programs at the same time. As long as one copy survives, the user program can be completed successfully. In the meantime, the user interacts with the replicated program as if it were a normal program. The authors call this characteristic user transparent replication. In order to achieve user transparent replication, program replicas must behave consistently. Otherwise, users might get different queries or output from different running replicas. The authors identify the reasons why the inconsistent program execution problem occurs and propose algorithms to ensure that computation replicas behave consistently. With consistent running program replicas, a filter program can be easily constructed to delete duplicated I/O requests or duplicated output and thus achieve user transparency.<<ETX>>

[1]  Andreas Reuter,et al.  Principles of transaction-oriented database recovery , 1983, CSUR.

[2]  Kenneth P. Birman Replication and fault-tolerance in the ISIS system , 1985, SOSP 1985.

[3]  Raymond A. Lorie,et al.  Physical integrity in a large segmented database , 1977, TODS.

[4]  David P. Reed,et al.  Implementing atomic actions on decentralized data , 1983, TOCS.

[5]  Brian Randell,et al.  Operating Systems, An Advanced Course , 1978 .

[6]  Andrew Birrell,et al.  Implementing remote procedure calls , 1984, TOCS.

[7]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[8]  Ron Obermarck,et al.  Distributed deadlock detection algorithm , 1982, TODS.

[9]  Robert H. Thomas,et al.  A Majority consensus approach to concurrency control for multiple copy databases , 1979, ACM Trans. Database Syst..

[10]  Philip A. Bernstein,et al.  The correctness of concurrency control mechanisms in a system for distributed databases (SDD-1) , 1980, TODS.

[11]  Jean-Claude Laprie,et al.  Saturation: reduced idleness for improved fault-tolerance , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[12]  Barbara Liskov,et al.  On Linguistic Support for Distributed Programs , 1982, IEEE Transactions on Software Engineering.

[13]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[14]  J. Eliot B. Moss,et al.  Checkpoint and Restart in Distributed Transaction Systems , 1983, Symposium on Reliability in Distributed Software and Database Systems.

[15]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[16]  Barbara Liskov,et al.  Primitives for distributed computing , 1979, SOSP '79.

[17]  MARTIN L. KERSTEN,et al.  Application of an optimistic concurrency control method , 1984, Softw. Pract. Exp..

[18]  Eric C. Cooper Circus: A Replicated Procedure Call Facility , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[19]  K. Mani Chandy,et al.  A distributed algorithm for detecting resource deadlocks in distributed systems , 1982, PODC '82.

[20]  Tony P. Ng,et al.  Replicated transactions , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[21]  Andreas Reuter A Fast Transaction-Oriented Logging Scheme for Undo Ro overy , 1980, IEEE Transactions on Software Engineering.

[22]  Liba Svobodova Resilient Distributed Computing , 1984, IEEE Transactions on Software Engineering.

[23]  J. T. Robinson,et al.  On optimistic methods for concurrency control , 1979, TODS.

[24]  Philip A. Bernstein,et al.  Concurrency control in a system for distributed databases (SDD-1) , 1980, TODS.

[25]  Partha Dasgupta,et al.  Fault Tolerant Computing in Object Based Distributed Operating Systems , 1987, SRDS.