论文信息 - High-Level Fault Tolerance in Distributed Programs

High-Level Fault Tolerance in Distributed Programs

We have been developing high-level checkpoint and restart methods for Dome (Distributed Object Migration Environment), a C++ library of data-parallel objects that are automatically distributed using PVM. There are several levels of programming abstraction at which fault tolerance mechanisms can be designed: high-level, where the checkpoint and restart are built into our C++ objects, but the program structure is severly constrained; high-level with preprocessing, where a preprocessor inserts extra C++ statements into the code to facilitate checkpoint and restart; and low-level, where periodically an interrupt causes a memory image to be written out. Because we consider portability (both of our libraries and of the checkpoints they produce) to be an important goal, we focus on the higher-level checkpointing methods. In addition, we describe an implementation of high-level checkpointing, demonstrate it on multiple architectures, and show that it is efficient enough to provide good expected run times with low overhead, even in the case of frequent failures.

Erik Seligman | Adam Beguelin

[1] G. C. Fox,et al. What have we learnt from using real parallel machines to solve real problems? , 1989, C3P.

[2] Jeffrey F. Naughton,et al. Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[3] James M. Purtilo,et al. Dynamic reconfiguration in distributed systems: adapting software modules for replacement , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[4] Sheldon M. Ross,et al. A First Course in Probability , 1979 .

[5] M. Moura Silva,et al. Checkpointing SPMD applications on transputer networks , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[6] Erol Gelenbe,et al. On the Optimum Checkpoint Interval , 1979, JACM.

[7] Jeffrey F. Naughton,et al. Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[8] Kai Li,et al. ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[9] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[10] Luís Moura Silva,et al. Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[11] Darrell D. E. Long,et al. A study of the reliability of Internet sites , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[12] Peter Steenkiste,et al. Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[13] Erik Seligman,et al. Dome: Distributed Object Migration Environment , 1994 .

[14] Andrzej Duda,et al. The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[15] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.