Fault-tolerant atomic computations in an object-based distributed system

A distributed system can support fault-tolerant applications by replicating data and computation at nodes that have independent failure modes. We present a scheme called parallel execution threads (PET) which can be used to implement fault-tolerant computations in an object-based distributed system. In a system that replicates objects, the PET scheme can be used to replicate a computation by creating a number of parallel threads which execute with different replicas of the invoked objects. A computation can be completed successfully if at least one thread does not encounter any failed nodes and its completion preserves the consistency of the objects. The PET scheme can tolerate failures that occur during the execution of the computation as long as all threads are not affected by the failures. We present the algorithms required to implement the PET scheme and also address some performance issues.

[1]  Philip A. Bernstein,et al.  An algorithm for concurrency control and recovery in replicated distributed databases , 1984, TODS.

[2]  Hector Garcia-Molina,et al.  Elections in a Distributed Computing System , 1982, IEEE Transactions on Computers.

[3]  Barbara Liskov,et al.  Viewstamped Replication: A General Primary Copy , 1988, PODC.

[4]  Satish K. Tripathi,et al.  Fault tolerant remote procedure call , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[5]  Michael Stonebraker,et al.  Concurrency Control and Consistency of Multiple Copies of Data in Distributed Ingres , 1979, IEEE Transactions on Software Engineering.

[6]  Eric C. Cooper Replicated distributed programs , 1985, SOSP '85.

[7]  Partha Dasgupta,et al.  The Clouds distributed operating system: functional description, implementation details and related work , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[8]  Amr Elabbadi Implementing Fault-Tolerant Distributed Objects , 1985 .

[9]  M. Ahamad,et al.  The architecture of Ra: a kernel for Clouds , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume II: Software Track.

[10]  Barbara Liskov,et al.  Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[11]  Partha Dasgupta,et al.  Fault Tolerant Computing in Object Based Distributed Operating Systems , 1987, SRDS.

[12]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[13]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.