N-Level Diskless Checkpointing

Diskless checkpointing is an efficient technique to tolerate a small number of processor failures in large parallel and distributed systems. In literature, a simultaneous failure of no more than N processors is often tolerated by using a one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures, whose overhead often increases quickly as N increases.  In this paper, we study an N-level diskless checkpointing scheme to reduce the overhead for tolerating a simultaneous failure of no more than N processors by layering the schemes for a simultaneous failure of i processors, where i = 1, 2, . . . ,N. Simulation results indicate the proposed N-level diskless checkpointing scheme achieves lower fault tolerance overhead than the one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures.

[1]  Jack J. Dongarra,et al.  Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..

[2]  Luís Moura Silva,et al.  An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[3]  George Bosilca,et al.  Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[4]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[5]  Hong Chen,et al.  Performance Optimization of Checkpointing Schemes with Task Duplication , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[6]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[7]  Yang Jin,et al.  Fault-tolerant mechanism of the distributed cluster computers , 2007 .

[8]  Nitin H. Vaidya Another Two-Level Failure Recovery Scheme , 1994 .

[9]  Jack Dongarra,et al.  Fault tolerant matrix operations for networks of workstations using multiple checkpointing , 1997, Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97.

[10]  Zizhong Chen,et al.  Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing , 2009, IEEE Transactions on Computers.

[11]  Kai Li,et al.  Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[12]  S. Sudarshan,et al.  Distributed Multi-Level Recovery in Main-Memory Databases , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[13]  尚毅梓,et al.  Fault-Tolerant Technique in the Cluster Computation of the Digital Watershed Model , 2007 .

[14]  Luís Moura Silva,et al.  Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..

[15]  Christian Engelmann,et al.  A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..