Communication-Based Prevention of Non-P-Pattern

An issue pertinent to the design of checkpointing protocols is how to improve the autonomy of checkpointing and keep computation loss under control. To address the problem, a time-based multi-cycle checkpointing protocol is proposed in this paper. In this protocol, processes are allowed to take checkpoints with desired checkpoint cycles. To enable recent checkpoints to be used to form a consistent global checkpoint, a communication-based checkpoint cycle adjustment approach is also proposed. In this approach, the checkpoint cycle adjustment of each process follows a P-pattern. Simulation results show that the rollback deviation of the proposed protocol can be well controlled under a low checkpointing overhead.

[1]  Nuno Neves,et al.  Coordinated checkpointing without direct coordination , 1998, Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248).

[2]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[3]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[4]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[5]  Lorenzo Alvisi,et al.  Causality tracking in causal message-logging protocols , 2002, Distributed Computing.

[6]  Achour Mostéfaoui,et al.  Communication-based prevention of useless checkpoints in distributed computations , 2000, Distributed Computing.

[7]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[8]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[9]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[10]  Chita R. Das,et al.  Towards a communication characterization methodology for parallel applications , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[11]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[12]  Jichiang Tsai On Properties of RDT Communication-Induced Checkpointing Protocols , 2003, IEEE Trans. Parallel Distributed Syst..

[13]  W. Kent Fuchs,et al.  Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[14]  Brian Randell System structure for software fault tolerance , 1975 .

[15]  Yixin Yang,et al.  A Novel Roll-Back Mechanism for Performance Enhancement of Asynchronous Checkpointing and Recovery , 2007, Informatica.

[16]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[17]  Roberto Baldoni,et al.  An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[18]  David J. Lilja,et al.  Exploiting multiple heterogeneous networks to reduce communication costs in parallel programs , 1997, Proceedings Sixth Heterogeneous Computing Workshop (HCW'97).

[19]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[20]  David J. Lilja,et al.  Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs , 1998, CANPC.

[21]  Islene C. Garcia,et al.  Non-Blocking Synchronous Checkpointing Based on Rollback-Dependency Trackability , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).