Asynchronous recovery protocols for distributed systems

The authors address the problem of error recovery in a system of distributed communication processes. They show that if each process can detect its local computation errors while establishing the recovery points, then the amount of process dependencies can be reduced by exploiting the temporal ordering of message communication among the processes. The proposed approach allows processes to proceed independently during normal computation, and can be further improved to accommodate independent rollback without explicit coordination. The authors also discuss the handling of messages that are originated from, or received by, tasks that later abort. Simulation studies indicate that the approach taken achieves a much higher throughput than the synchronous approach.<<ETX>>

[1]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[2]  Brian Randell System structure for software fault tolerance , 1975 .

[3]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[4]  K. H. Kim,et al.  An Analysis of the Execution Overhead Inherent in the Conversation Scheme , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[5]  Jim Gray,et al.  The Transaction Concept: Virtues and Limitations (Invited Paper) , 1981, VLDB.

[6]  W. G. Wood Recovery Control of Communicating Processes in a Distributed System , 1985 .

[7]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.