论文信息 - Using two-level stable storge for efficient checkpointing

Using two-level stable storge for efficient checkpointing

Checkpointing and rollback recovery is a very effective technique to tolerate the occurrence of failures. Usually, checkpoint data is saved on disk, however, in some situations the time to write the data to disk can represent a considerable performance overhead. Alternative solutions would make use of main memory to maintain the checkpoint data. The paper starts by presenting two main memory checkpointing schemes: neighbour based and parity checkpointing. Both schemes have been implemented and evaluated in a commercial parallel machine. The results show that neighbour based checkpointing presents a very low performance overhead and assures a fast recovery for partial failures. However, it is not able to tolerate multiple and total failures of the system. To solve this shortcoming the authors propose a two-level stable storage integrating the use of neighbour based with disk based checkpointing. This approach combines the advantages of the two schemes: the efficiency of diskless checkpointing with the high reliability of disk based checkpointing.

Luís Moura Silva | João Gabriel Silva

[1] Willy Zwaenepoel,et al. On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[2] Tzi-cker Chiueh,et al. Evaluation of checkpoint mechanisms for massively parallel machines , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[3] Garth A. Gibson. Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis , 1990 .

[4] Kai Li,et al. ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[5] James S. Plank. Efficient checkpointing on MIMD architectures , 1993 .

[6] Message P Forum,et al. MPI: A Message-Passing Interface Standard , 1994 .

[7] David B. Johnson,et al. Distributed system fault tolerance using message logging and checkpointing , 1990 .

[8] Jim Gray,et al. Fault Tolerance in Tandem Computer Systems , 1987 .

[9] Michel Banâtre,et al. Ensuring data security and integrity with a fast stable storage , 1988, Proceedings. Fourth International Conference on Data Engineering.

[10] Mary Baker,et al. The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment , 1992, USENIX Summer.

[11] Raymie Stata,et al. Specifying data availability in multi-device file systems , 1990, OPSR.