A New Diskless Checkpointing Approach for Multiple Processor Failures

Diskless checkpointing is an important technique for performing fault tolerance in distributed or parallel computing systems. This study proposes a new approach to enhance neighbor-based diskless checkpointing to tolerate multiple failures using simple checkpointing and failure recovery operations, without relying on dedicated checkpoint processors. In this scheme, each processor saves its checkpoints in a set of peer processors, called checkpoint storage nodes. In return, each processor uses simple XOR operations to store a collection of checkpoints for the processors for which it is a checkpoint storage node. This study defines the concept of safe recovery criterion, which specifies the requirement for ensuring that any failed processor can be recovered in a single step using the checkpoint data stored at one of the surviving processors, as long as no more than a given number of failures occur. This study further identifies the necessary and sufficient conditions for satisfying the safe recovery criterion and presents a method for designing checkpoint storage node sets that meet these requirements. The proposed scheme allows failure recovery to be performed in a distributed manner using XOR operations.

[1]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[2]  Lihao Xu,et al.  An efficient XOR-scheduling algorithm for erasure codes encoding , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[3]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[4]  Luís Moura Silva,et al.  Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..

[5]  Kai Li,et al.  Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[6]  David F. Heidel,et al.  An Overview of the BlueGene/L Supercomputer , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[7]  John Zahorjan,et al.  The challenges of mobile computing , 1994, Computer.

[8]  Tong-Ying Tony Juang,et al.  An Efficient Asynchronous Recovery Algorithm In Wireless Mobile Ad Hoc Networks , 2002 .

[9]  Ge-Ming Chiu,et al.  Hardware-supported asynchronous checkpointing scheme , 1998 .

[10]  WangYi-Min Consistent Global Checkpoints that Contain a Given Set of Local Checkpoints , 1997 .

[11]  Achour Mostéfaoui,et al.  Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[12]  Nitin H. Vaidya,et al.  A Case for Two-Level Recovery Schemes , 1998, IEEE Trans. Computers.

[13]  Tzi-cker Chiueh,et al.  Evaluation of checkpoint mechanisms for massively parallel machines , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[14]  Stuart I. Feldman,et al.  IGOR: a system for program debugging via reversible execution , 1988, PADD '88.

[15]  Yookun Cho,et al.  Adaptive Mobile Checkpointing Facility for Wireless Sensor Networks , 2006, ICCSA.

[16]  Zizhong Chen,et al.  A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.

[17]  J. Plank A New MDS Erasure Code for RAID-6 , 2007 .

[18]  William E. Johnston,et al.  Coding for High Availability of a Distributed-Parallel Storage System , 1998, IEEE Trans. Parallel Distributed Syst..

[19]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[20]  Jack J. Dongarra,et al.  Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..

[21]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[22]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[23]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[24]  Jack J. Dongarra,et al.  Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[25]  W. Kent Fuchs,et al.  CATCH-compiler-assisted techniques for checkpointing , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[26]  Luís Moura Silva,et al.  An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[27]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[28]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[29]  James S. Plank,et al.  A tutorial on Reed–Solomon coding for fault‐tolerance in RAID‐like systems , 1997, Softw. Pract. Exp..

[30]  Sy-Yen Kuo,et al.  More Properties of Communication-Induced Checkpointing Protocols with Rollback-Dependency Trackability , 2005, J. Inf. Sci. Eng..

[31]  Ge-Ming Chiu,et al.  Placing forced checkpoints in distributed real-time embedded systems , 2002 .

[32]  George Bosilca,et al.  Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[33]  Zizhong Chen,et al.  Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing , 2009, IEEE Transactions on Computers.

[34]  Wei-Hua Hao,et al.  Mutual-Aid: Diskless Checkpointing Scheme for Tolerating Double Faults , 2008, 2008 10th IEEE International Conference on High Performance Computing and Communications.

[35]  Ge-Ming Chiu,et al.  Efficient Rollback-Recovery Technique in Distributed Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[36]  Sy-Yen Kuo,et al.  Adaptive Communication-Induced Checkpointing Protocols with Domino-Effect Freedom , 2004, J. Inf. Sci. Eng..

[37]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .