Self-Adjusting Indexing Techniques for Communication-Induced Checkpointing Protocols

Communication-induced checkpointing (CIC) protocols can be used to prevent the domino effect. Among such protocols, those belonging to the index-based category associate checkpoints with sequence numbers in a way that checkpoints with an equal sequence number are ensured to be consistent. Specifically, index-based protocols must cooperate with their underlying indexing methods to achieve their goal. The adopted indexing scheme makes a great impact on the number of forced checkpoint the protocol will take. Moreover, an indexing method exhibit different performance for different degrees of heterogeneity imposed on a distributed system. All existing index-based protocols only employ a fixed indexing scheme, however, and thus cannot suit themselves well for all kinds of computing environments. In this paper, we propose two new indexing techniques that can adjust themselves according to the extent of present system heterogeneity. Those new methods are also justified by a simulation study in the text.

[1]  Bruno Ciciani,et al.  On the No-Z-Cycle Property in Distributed Executions , 2000, J. Comput. Syst. Sci..

[2]  Brian Randell System structure for software fault tolerance , 1975 .

[3]  Achour Mostéfaoui,et al.  Communication-based prevention of useless checkpoints in distributed computations , 2000, Distributed Computing.

[4]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[5]  Luiz Eduardo Buzato,et al.  Systematic Analysis of Index-Based Checkpointing Algorithms using Simulation , 2007 .

[6]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[7]  John Paul Shen,et al.  Continuous signature monitoring: low-cost concurrent detection of processor control errors , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[8]  John Paul Shen,et al.  Processor Control Flow Monitoring Using Signatured Instruction Streams , 1987, IEEE Transactions on Computers.

[9]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[10]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[11]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[12]  Roberto Baldoni,et al.  An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[13]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[14]  W. Kent Fuchs,et al.  Experimental Evaluation of Multiprocessor Cache-Based Error Recovery , 1991, ICPP.

[15]  Shambhu J. Upadhyaya,et al.  Concurrent Process Monitoring with No Reference Signatures , 1994, IEEE Trans. Computers.

[16]  Sumana Srinivasan,et al.  Modified butterfly structure for efficient implementation of pruned fast cosine transform , 1998 .

[17]  Massimo Violante,et al.  Soft-error detection using control flow assertions , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[18]  Yung-Yuan Chen,et al.  Signature-monitoring technique based on instruction-bit grouping , 2005 .

[19]  Jichiang Tsai Performance comparisons of index‐based communication‐induced checkpointing protocols , 2006 .

[20]  Yung-Yuan Chen,et al.  Concurrent detection of control flow errors by hybrid signature monitoring , 2005, IEEE Transactions on Computers.

[21]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[22]  Yuh-Ming Huang,et al.  A refined fast 2-D discrete cosine transform algorithm with regular butterfly structure , 1998 .

[23]  Y. Savaria,et al.  Software detection mechanisms providing full coverage against single bit-flip faults , 2004, IEEE Transactions on Nuclear Science.

[24]  Barry W. Johnson,et al.  A Fault Injection Technique for VHDL Behavioral-Level Models , 1996, IEEE Des. Test Comput..

[25]  John Paul Shen,et al.  Continuous signature monitoring: efficient concurrent-detection of processor control errors , 1988, International Test Conference 1988 Proceeding@m_New Frontiers in Testing.

[26]  Achour Mostéfaoui,et al.  Virtual Precedence in Asynchronous Systems: Cencept and Applications , 1997, WDAG.

[27]  Régis Leveugle,et al.  Design of microprocessors with built-in on-line test , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[28]  Jaehee You,et al.  One- and two-dimensional constant geometry fast cosine transform algorithms and architectures , 1999, IEEE Trans. Signal Process..

[29]  Timothy Kong,et al.  Concurrent Detection of Software and Hardware Data-Access Faults , 1997, IEEE Trans. Computers.

[30]  W. Kent Fuchs,et al.  Consistent Global Checkpoints Based on Direct Dependency Tracking , 1994, Inf. Process. Lett..

[31]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.