论文信息 - Communication-Based Prevention of Non-P-Pattern

Communication-Based Prevention of Non-P-Pattern

An issue pertinent to the design of checkpointing protocols is how to improve the autonomy of checkpointing and keep computation loss under control. To address the problem, a time-based multi-cycle checkpointing protocol is proposed in this paper. In this protocol, processes are allowed to take checkpoints with desired checkpoint cycles. To enable recent checkpoints to be used to form a consistent global checkpoint, a communication-based checkpoint cycle adjustment approach is also proposed. In this approach, the checkpoint cycle adjustment of each process follows a P-pattern. Simulation results show that the rollback deviation of the proposed protocol can be well controlled under a low checkpointing overhead.

[1] Nuno Neves,et al. Coordinated checkpointing without direct coordination , 1998, Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248).

[2] Richard D. Schlichting,et al. Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[3] Willy Zwaenepoel,et al. The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[4] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[5] Lorenzo Alvisi,et al. Causality tracking in causal message-logging protocols , 2002, Distributed Computing.

[6] Achour Mostéfaoui,et al. Communication-based prevention of useless checkpoints in distributed computations , 2000, Distributed Computing.

[7] Jeffrey S. Vetter,et al. Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[8] L. Alvisi,et al. A Survey of Rollback-Recovery Protocols , 2002 .

[9] D. Manivannan,et al. Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[10] Chita R. Das,et al. Towards a communication characterization methodology for parallel applications , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[11] Lorenzo Alvisi,et al. An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[12] Jichiang Tsai. On Properties of RDT Communication-Induced Checkpointing Protocols , 2003, IEEE Trans. Parallel Distributed Syst..

[13] W. Kent Fuchs,et al. Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[14] Brian Randell. System structure for software fault tolerance , 1975 .

[15] Yixin Yang,et al. A Novel Roll-Back Mechanism for Performance Enhancement of Asynchronous Checkpointing and Recovery , 2007, Informatica.

[16] D. Manivannan,et al. A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[17] Roberto Baldoni,et al. An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[18] David J. Lilja,et al. Exploiting multiple heterogeneous networks to reduce communication costs in parallel programs , 1997, Proceedings Sixth Heterogeneous Computing Workshop (HCW'97).

[19] RICHARD KOO,et al. Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[20] David J. Lilja,et al. Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs , 1998, CANPC.

[21] Islene C. Garcia,et al. Non-Blocking Synchronous Checkpointing Based on Rollback-Dependency Trackability , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).