论文信息 - High-Level Fault Tolerance in Distributed

High-Level Fault Tolerance in Distributed

We have been developing high-level checkpoint and restart methods for Dome (Distributed Object Migration Environment), a C++ library of data-parallel objects that are automatically distributed using PVM. There are several levels of programming abstraction at which fault tolerance mechanisms can be designed: high-level, where the checkpoint and restart are built into our C++ objects, but the program structure is severly constrained; high-level with preprocessing, where a preprocessor inserts extra C++ statements into the code to facilitate checkpoint and restart; and low-level, where periodically an interrupt causes a memory image to be written out. Because we consider porta-bility (both of our libraries and of the checkpoints they produce) to be an important goal, we focus on the higher-level checkpointing methods. In addition, we describe an implementation of high-level checkpointing, demonstrate it on multiple architectures, and show that it is eecient enough to provide good expected run times with low overhead, even in the case of frequent failures.

A. Beguelin | E. Seligman

[1] Kai Li,et al. ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[2] M. Moura Silva,et al. Checkpointing SPMD applications on transputer networks , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[3] Erik Seligman,et al. Dome: Distributed Object Migration Environment , 1994 .

[4] James M. Purtilo,et al. Dynamic reconfiguration in distributed systems: adapting software modules for replacement , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[5] Peter Steenkiste,et al. Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[6] Jack Dongarra,et al. Pvm 3 user's guide and reference manual , 1993 .

[7] Luís Moura Silva,et al. Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[8] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[9] Jeffrey F. Naughton,et al. Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[10] Darrell D. E. Long,et al. A study of the reliability of Internet sites , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[11] Jeffrey F. Naughton,et al. Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[12] G. C. Fox,et al. What have we learnt from using real parallel machines to solve real problems? , 1989, C3P.

[13] Andrzej Duda,et al. The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[14] Erol Gelenbe,et al. On the Optimum Checkpoint Interval , 1979, JACM.

[15] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.