论文信息 - N-Level Diskless Checkpointing

N-Level Diskless Checkpointing

Diskless checkpointing is an efficient technique to tolerate a small number of processor failures in large parallel and distributed systems. In literature, a simultaneous failure of no more than N processors is often tolerated by using a one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures, whose overhead often increases quickly as N increases. In this paper, we study an N-level diskless checkpointing scheme to reduce the overhead for tolerating a simultaneous failure of no more than N processors by layering the schemes for a simultaneous failure of i processors, where i = 1, 2, . . . ,N. Simulation results indicate the proposed N-level diskless checkpointing scheme achieves lower fault tolerance overhead than the one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures.

Zizhong Chen | Douglas Hakkarinen

[1] Jack J. Dongarra,et al. Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..

[2] Luís Moura Silva,et al. An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[3] George Bosilca,et al. Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[4] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[5] Hong Chen,et al. Performance Optimization of Checkpointing Schemes with Task Duplication , 2006, First International Multi-Symposiums on Computer and Computational Sciences (IMSCCS'06).

[6] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[7] Yang Jin,et al. Fault-tolerant mechanism of the distributed cluster computers , 2007 .

[8] Nitin H. Vaidya. Another Two-Level Failure Recovery Scheme , 1994 .

[9] Jack Dongarra,et al. Fault tolerant matrix operations for networks of workstations using multiple checkpointing , 1997, Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97.

[10] Zizhong Chen,et al. Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing , 2009, IEEE Transactions on Computers.

[11] Kai Li,et al. Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[12] S. Sudarshan,et al. Distributed Multi-Level Recovery in Main-Memory Databases , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[13] 尚毅梓,et al. Fault-Tolerant Technique in the Cluster Computation of the Digital Watershed Model , 2007 .

[14] Luís Moura Silva,et al. Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..

[15] Christian Engelmann,et al. A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform , 2003, Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003..