Stable checkpointing in distributed systems without shared disks

Interacting processes an distributed systems save their checkpoints on local disks for efficiency reasons. But, because local checkpoints get unavailable with failing hosts, redundancy schemes similar to RAID-like storage schemes have to be used. In such systems, checkpoints are stable under a particular fault model because they can get reconstructed in the distributed system. In this paper, two variants of stable checkpoint storage are compared, (a) parity grouping over local checkpoints and (ii) RAID-like distribution of each checkpoint using a software based distributed storage system. An analysis is given to compare costs for collective checkpoint creation, recovery of a single process and rollback of all processes. The results show that despite the differences in detail, checkpointing using a distributed storage system is a reasonable solution.

[1]  Kai Li,et al.  ickp: a consistent checkpointer for multicomputers , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[2]  Randy H. Katz,et al.  Disk system architectures for high performance computing , 1989, Proc. IEEE.

[3]  Jack J. Dongarra,et al.  Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[4]  Yves Denneulin,et al.  nfsp: a distributed NFS server for clusters of workstations , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[5]  James S. Plank,et al.  Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[6]  A. Bonhomme,et al.  Performance evaluation of a distributed video storage system , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[7]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[8]  Peter Sobe Concurrent updates on striped data streams in clustered server systems , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.