论文信息 - Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner.

Piyush Maheshwari | Jinsong Ouyang | P. Maheshwari | J. Ouyang

[1] Yi-Min Wang,et al. Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[2] Yuval Tamir,et al. Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[3] Richard D. Schlichting,et al. Fail-stop processors: an approach to designing fault-tolerant computing systems , 1981, TOCS.

[4] Jonathan Walpole,et al. MIST: PVM with Transparent Migration and Checkpointing , 1995 .

[5] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[6] Richard D. Schlichting,et al. Fail-Stop Processors: An Approach to Designing Computing Systems , 1983 .

[7] Friedemann Mattern,et al. Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation , 1993, J. Parallel Distributed Comput..

[8] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[9] Jeffrey F. Naughton,et al. An efficient checkpointing method for multicomputers with wormhole routing , 1991, International Journal of Parallel Programming.

[10] David B. Johnson,et al. Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[11] James S. Plank. Efficient checkpointing on MIMD architectures , 1993 .

[12] W. Kent Fuchs,et al. Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[13] R.E. Strom,et al. A recoverable object store , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume II: Software track.

[14] Kai Li,et al. ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[15] Gernot Heiser,et al. Checkpointing and recovery for distributed shared memory applications , 1995, Proceedings of International Workshop on Object Orientation in Operating Systems.

[16] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.

[17] Yennun Huang,et al. Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[18] Micah Beck,et al. Compiler-Assisted Memory Exclusion for Fast Checkpointing , 1995 .

[19] Eric A. Brewer,et al. An Algorithm for Concurrent Search Trees , 1991, International Conference on Parallel Processing.

[20] W. Kent Fuchs,et al. Compiler‐assisted full checkpointing , 1994, Softw. Pract. Exp..

[21] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.

[22] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[23] Michael Litzkow,et al. Supporting checkpointing and process migration outside the UNIX kernel , 1999 .

[24] David F. Bacon,et al. Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[25] Georg Stellner,et al. CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[26] Vijay K. Garg,et al. How to recover efficiently and asynchronously when optimism fails , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[27] L. Alvisi,et al. Nonblocking and Orphan-Free Message Logging Protocols , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[28] Andrew S. Tanenbaum,et al. Computer Networks , 1981 .

[29] Jinsong Ouyang. Supporting cost-effective fault tolerance in distributed applications with file operations , 1997 .

[30] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[31] Jack J. Dongarra,et al. Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[32] Sean W. Smith,et al. Completely asynchronous optimistic recovery with minimal rollbacks , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[33] Gilbert Cabillic,et al. The performance of consistent checkpointing in distributed shared memory systems , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[34] Peter Steenkiste,et al. Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[35] D. Manivannan,et al. A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[36] Ten-Hwang Lai,et al. On Distributed Snapshots , 1987, Inf. Process. Lett..

[37] Gernot Heiser,et al. Libra: A Library for Reliable Distributed Applications , 1996, PDPTA.

[38] Georg Stellner. Consistent Checkpoints of PVM Applications , 1994 .

[39] Jeffrey F. Naughton,et al. Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[40] Jeffrey F. Naughton,et al. Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.