Self-Adjusting Indexing Techniques for Communication-Induced Checkpointing Protocols

Communication-induced checkpointing (CIC) protocols can be used to prevent the domino effect. Among such protocols, those belonging to the index-based category associate checkpoints with sequence numbers in a way that checkpoints with an equal sequence number are ensured to be consistent. Specifically, index-based protocols must cooperate with their underlying indexing methods to achieve their goal. The adopted indexing scheme makes a great impact on the number of forced checkpoint the protocol will take. Moreover, an indexing method exhibit different performance for different degrees of heterogeneity imposed on a distributed system. All existing index-based protocols only employ a fixed indexing scheme, however, and thus cannot suit themselves well for all kinds of computing environments. In this paper, we propose two new indexing techniques that can adjust themselves according to the extent of present system heterogeneity. Those new methods are also justified by a simulation study in the text.

[1]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[2]  Luiz Eduardo Buzato,et al.  Systematic Analysis of Index-Based Checkpointing Algorithms using Simulation , 2007 .

[3]  Jichiang Tsai Performance comparisons of index‐based communication‐induced checkpointing protocols , 2006 .

[4]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[5]  Roberto Baldoni,et al.  An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[6]  W. Kent Fuchs,et al.  Consistent Global Checkpoints Based on Direct Dependency Tracking , 1994, Inf. Process. Lett..

[7]  Achour Mostéfaoui,et al.  Virtual Precedence in Asynchronous Systems: Cencept and Applications , 1997, WDAG.

[8]  W. Kent Fuchs,et al.  Experimental Evaluation of Multiprocessor Cache-Based Error Recovery , 1991, ICPP.

[9]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[10]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[11]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[12]  Achour Mostéfaoui,et al.  Communication-based prevention of useless checkpoints in distributed computations , 2000, Distributed Computing.

[13]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[14]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[15]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[16]  Bruno Ciciani,et al.  On the No-Z-Cycle Property in Distributed Executions , 2000, J. Comput. Syst. Sci..