论文信息 - A New Diskless Checkpointing Approach for Multiple Processor Failures

A New Diskless Checkpointing Approach for Multiple Processor Failures

Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple checkpointing and failure recovery operations, without relying on dedicated checkpoint processors. In this scheme, each processor saves its checkpoints in a set of peer processors, called checkpoint storage nodes. In return, each processor uses simple XOR operations to store a collection of checkpoints for the processors for which it is a checkpoint storage node. This study defines the concept of safe recovery criterion, which specifies the requirement for ensuring that any failed processor can be recovered in a single step using the checkpoint data stored at one of the surviving processors, as long as no more than a given number of failures occur. This study further identifies the necessary and sufficient conditions for satisfying the safe recovery criterion and presents a method for designing checkpoint storage node sets that meet these requirements. The proposed scheme allows failure recovery to be performed in a distributed manner using XOR operations.

Ge-Ming Chiu | Jane-Ferng Chiu | Ge-Ming Chiu | Jane-Ferng Chiu

[1] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[2] Lihao Xu,et al. An efficient XOR-scheduling algorithm for erasure codes encoding , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[3] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[4] Luís Moura Silva,et al. Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..

[5] Kai Li,et al. Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[6] David F. Heidel,et al. An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[7] John Zahorjan,et al. The challenges of mobile computing , 1994, Computer.

[8] Tong-Ying Tony Juang,et al. An Efficient Asynchronous Recovery Algorithm In Wireless Mobile Ad Hoc Networks , 2002 .

[9] Ge-Ming Chiu,et al. Hardware-supported asynchronous checkpointing scheme , 1998 .

[10] WangYi-Min. Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints , 1997 .

[11] Achour Mostéfaoui,et al. Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[12] Nitin H. Vaidya,et al. A Case for Two-Level Recovery Schemes , 1998, IEEE Trans. Computers.

[13] Tzi-cker Chiueh,et al. Evaluation of checkpoint mechanisms for massively parallel machines , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[14] Stuart I. Feldman,et al. IGOR: a system for program debugging via reversible execution , 1988, PADD '88.

[15] Yookun Cho,et al. Adaptive Mobile Checkpointing Facility for Wireless Sensor Networks , 2006, ICCSA.

[16] Zizhong Chen,et al. A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.

[17] J. Plank. A New MDS Erasure Code for RAID-6 , 2007 .

[18] William E. Johnston,et al. Coding for High Availability of a Distributed-Parallel Storage System , 1998, IEEE Trans. Parallel Distributed Syst..

[19] Kai Li,et al. Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[20] Jack J. Dongarra,et al. Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..

[21] Yin-Min Wang,et al. Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[22] Richard D. Schlichting,et al. Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[23] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[24] Jack J. Dongarra,et al. Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[25] W. Kent Fuchs,et al. CATCH-compiler-assisted techniques for checkpointing , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[26] Luís Moura Silva,et al. An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[27] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[28] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.

[29] James S. Plank,et al. A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[30] Sy-Yen Kuo,et al. More Properties of Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability , 2005, J. Inf. Sci. Eng..

[31] Ge-Ming Chiu,et al. Placing forced checkpoints in distributed real-time embedded systems , 2002 .

[32] George Bosilca,et al. Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[33] Zizhong Chen,et al. Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing , 2009, IEEE Transactions on Computers.

[34] Wei-Hua Hao,et al. Mutual-Aid: Diskless Checkpointing Scheme for Tolerating Double Faults , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[35] Ge-Ming Chiu,et al. Efficient Rollback-Recovery Technique in Distributed Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[36] Sy-Yen Kuo,et al. Adaptive Communication-Induced Checkpointing Protocols with Domino-Effect Freedom , 2004, J. Inf. Sci. Eng..

[37] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .