Hybrid checkpointing protocol based on selective-sender-based message logging

This paper presents a hybrid checkpointing protocol-an asynchronous checkpointing protocol using a message sending/receiving state change for reducing the overhead of failure-free operation combined with a selective sender-based message logging protocol for reducing the cascade rollback of asynchronous checkpointing protocol. The selective sender-based message logging protocol records only potential orphan messages when taking a checkpoint. And this paper presents a message dependency tree recording the inter-process message sending/receiving information on a volatile storage for reducing the search time of inter-process information during the failure recovery.

[1]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[2]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[3]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[4]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[5]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[6]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[7]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[8]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[9]  Michel Raynal,et al.  Consistent Checkpointing in Message Passing Distributed Systems , 1995 .

[10]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[11]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[12]  Jian Xu,et al.  Adaptive independent checkpointing for reducing rollback propagation , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.