论文信息 - An efficient coordinated checkpointing scheme for multicomputers

An efficient coordinated checkpointing scheme for multicomputers

A new approach for checkpointing multicomputer applications is presented. The checkpointing is initiated and controlled by a checkpoint coordinator, residing either on one of the nodes running the application or on the host processor attached to the multicomputer. A message count is used to determine if any messages are in transit. The proposed strategy is hardware-independent and can be implemented in any multicomputer system irrespective of the architecture, interconnection, and routing strategy. This scheme can be used for FIFO and non-FIFO channels as well as with channels where messages can be lost. Measurement results obtained from our simulations indicate that the proposed strategy outperforms an existing scheme proposed for fixed-path wormhole-routed multicomputer systems. Although the proposed strategy is targeted for high-performance, massively parallel multicomputers, it can also be used in any general-purpose distributed system to improve the checkpointing overhead.

Dhiraj K. Pradhan | Debendra Das Sharma

[1] Ten-Hwang Lai,et al. On Distributed Snapshots , 1987, Inf. Process. Lett..

[2] Leslie Lamport,et al. Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[3] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[4] Ten-Hwang Lai,et al. Termination Detection for Dynamically Distributed Systems with Non-first-in-first-out Communication , 1986, J. Parallel Distributed Comput..

[5] Jeffrey F. Naughton,et al. Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[6] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.