Distributed Checkpointing on Clusters with Dynamic Striping and Staggering

This paper presents a new striped and staggered checkpointing (SSC) scheme for multicomputer clusters. We consider serverless clusters, where local disks attached to cluster nodes collectively form a distributed RAID (redundant array of inexpensive disks) with a single I/O space. The distributed RAID is used to save the checkpoint files periodically. Striping enables parallel I/O on distributed disks. Staggering avoids network bottleneck in distributed disk I/O operations. With a fixed cluster size, we reveal the tradeoffs between these two speedup techniques. Our SSC approach allows dynamical reconfiguration to minimize message-logging requirements among concurrent software processes. We demonstrate how to reduce the checkpointing overhead by striping and staggering dynamically. For communication-intensive programs, our SCC scheme can significantly reduce the checkpointing overhead. Benchmark results prove the benefits of trading between stripe parallelism and distributed staggering. These results are useful to design efficient checkpointing schemes for fast rollback recovery from any single node (disk) failure in a cluster of computers.

[1]  Hai Jin,et al.  Orthogonal Striping and Mirroring in Distributed RAID for I/O-Centric Cluster Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[2]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[3]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[4]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[5]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[6]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[7]  Jian Xu,et al.  Adaptive independent checkpointing for reducing rollback propagation , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[8]  Michael Allen,et al.  Parallel programming: techniques and applications using networked workstations and parallel computers , 1998 .

[9]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[10]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[11]  Yong Deng,et al.  Checkpointing and rollback-recovery algorithms in distributed systems , 1994, J. Syst. Softw..

[12]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[13]  Hai Jin,et al.  Designing SSI clusters with hierarchical checkpointing and single I/O space , 1999, IEEE Concurr..

[14]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[15]  Mukesh Singhal,et al.  Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[16]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[17]  David A. Patterson,et al.  Designing Disk Arrays for High Data Reliability , 1993, J. Parallel Distributed Comput..

[18]  Kishor S. Trivedi,et al.  Reliability Analysis of Redundant Arrays of Inexpensive Disks , 1993, J. Parallel Distributed Comput..

[19]  Hai Jin,et al.  Reliable cluster computing with a new checkpointing RAID-x architecture , 2000, Proceedings 9th Heterogeneous Computing Workshop (HCW 2000) (Cat. No.PR00556).

[20]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[21]  Mukesh Singhal,et al.  On Coordinated Checkpointing in Distributed Systems , 1998, IEEE Trans. Parallel Distributed Syst..

[22]  Nitin H. Vaidya,et al.  Staggered Consistent Checkpointing , 1999, IEEE Trans. Parallel Distributed Syst..