Finding a suitable checkpoint and recovery protocol for a distributed application

Checkpoint and recovery protocols are commonly used in distributed applications for providing fault tolerance. The performance of a checkpoint and recovery protocol is judged by the amount of computation it can save against the amount of overhead it incurs. This performance depends on different system and application characteristics, as well as protocol specific parameters. Hence, no single checkpoint and recovery protocol works equally well for all applications, and given a distributed application and a system it will run on, it is important to choose a protocol that will give the best performance for that system and application. In this paper, we present a scheme to automatically identify a suitable checkpoint and recovery protocol for a given distributed application running on a given system. The scheme involves a novel technique for finding the similarity between the communication pattern of two distributed applications that is of independent interest also. The similarity measure is based on a graph similarity problem. We present a heuristic for the graph similarity problem. Extensive experimental results are shown both for the graph similarity heuristic and the automatic identification scheme to show that an appropriate checkpoint and recovery protocol can be chosen automatically for a given application.

[1]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[2]  Elliott I. Organick Editor's Overview , 1975, CSUR.

[3]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[4]  Luís Moura Silva,et al.  The performance of coordinated and independent checkpointing , 1999, Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999.

[5]  Willy Zwaenepoel,et al.  Manetho: fault tolerance in distributed systems using rollback-recovery and process replication , 1994 .

[6]  Horst Bunke,et al.  A decision tree approach to graph and subgraph isomorphism detection , 1999, Pattern Recognit..

[7]  Harrick M. Vin,et al.  The Cost of Recovery in Message Logging Protocols , 2000, IEEE Trans. Knowl. Data Eng..

[8]  Horst Bunke,et al.  A graph distance metric based on the maximal common subgraph , 1998, Pattern Recognit. Lett..

[9]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[10]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[11]  Anjali Agarwal,et al.  A unified approach to fault-tolerance in communication protocols based on recovery procedures , 1996, TNET.

[12]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[13]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[14]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[15]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[16]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[17]  Julian R. Ullmann,et al.  An Algorithm for Subgraph Isomorphism , 1976, J. ACM.

[18]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[19]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[20]  Mark Johnson An ordering of some metrics defined on the space of graphs , 1987 .

[21]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[22]  David M. Young,et al.  Applied Iterative Methods , 2004 .

[23]  Kim Taylor,et al.  The inhibition spectrum and the achievement of causal consistency , 1990, PODC '90.

[24]  Butler W. Lampson,et al.  Crash Recovery in a Distributed Data Storage System , 1981 .

[25]  Arobinda Gupta,et al.  Performance comparison of checkpoint and recovery protocols , 2003, Concurr. Comput. Pract. Exp..

[26]  Scott Fortin The Graph Isomorphism Problem , 1996 .

[27]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[28]  Lorenzo Alvisi Understanding the message logging paradigm for masking process crashes , 1996 .

[29]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[30]  Douglas C. Schmidt,et al.  A Fast Backtracking Algorithm to Test Directed Graphs for Isomorphism Using Distance Matrices , 1976, J. ACM.

[31]  Gene H. Golub,et al.  Matrix computations , 1983 .