Resolving error propagation in distributed systems

This paper investigates the problem of error propagation in distributed systems. To resolve this problem, a state preservation scheme is presented to save process states in main memory. Based on the state preservation, the processes suffering from error propagation can be recovered without involving stable storage. The recovery overhead is significantly reduced. In addition, a well-known single-source-all-destination graph algorithm is also utilized to find the optimal recovery points of the processes suffering from error propagation.

[1]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[2]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[3]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[4]  W. Kent Fuchs,et al.  Relaxing consistency in recoverable distributed shared memory , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[5]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[6]  Dhiraj K. Pradhan,et al.  Recovery in Multicomputers with Finite Error Detection Latency , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[7]  Udi Manber,et al.  Introduction to algorithms - a creative approach , 1989 .

[8]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..