High-Level Fault Tolerance in Distributed

We have been developing high-level checkpoint and restart methods for Dome (Distributed Object Migration Environment), a C++ library of data-parallel objects that are automatically distributed using PVM. There are several levels of programming abstraction at which fault tolerance mechanisms can be designed: high-level, where the checkpoint and restart are built into our C++ objects, but the program structure is severly constrained; high-level with preprocessing, where a preprocessor inserts extra C++ statements into the code to facilitate checkpoint and restart; and low-level, where periodically an interrupt causes a memory image to be written out. Because we consider porta-bility (both of our libraries and of the checkpoints they produce) to be an important goal, we focus on the higher-level checkpointing methods. In addition, we describe an implementation of high-level checkpointing, demonstrate it on multiple architectures, and show that it is eecient enough to provide good expected run times with low overhead, even in the case of frequent failures.

[1]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[2]  M. Moura Silva,et al.  Checkpointing SPMD applications on transputer networks , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[3]  Erik Seligman,et al.  Dome: Distributed Object Migration Environment , 1994 .

[4]  James M. Purtilo,et al.  Dynamic reconfiguration in distributed systems: adapting software modules for replacement , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[5]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[6]  Jack Dongarra,et al.  Pvm 3 user's guide and reference manual , 1993 .

[7]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[8]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[9]  Jeffrey F. Naughton,et al.  Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[10]  Darrell D. E. Long,et al.  A study of the reliability of Internet sites , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[11]  Jeffrey F. Naughton,et al.  Real-time, concurrent checkpoint for parallel programs , 1990, PPOPP '90.

[12]  G. C. Fox,et al.  What have we learnt from using real parallel machines to solve real problems? , 1989, C3P.

[13]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[14]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[15]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.