Log-based rollback recovery builds on the ideas of checkpoint-based rollback recovery and improves the characteristics of the recovery process. The basic idea capture by the log-based rollback recovery techniques is an extension of the checkpoint idea. Only, instead of relying solely on checkpoints for recovering from the occurrence of an error, the system logs information about the non-deterministic events (e.g. the reception of a message) that happen between successive checkpoints. After an error occurs, the system uses checkpoints to recover a recent error-free state and replays the logged events to move its execution to a point as close as possible to the occurrence of the error. This paper presents four design patterns that capture the most widely used methods for log-based rollback recovery. First, the Logger pattern captures the general log-based rollback recovery idea. Then, the other three patterns (Optimistic Logging, Pessimistic Logging, and Causal Logging) describes specific solutions about when, where and how to keep logs about the nondeterministic events that have driven the execution of the system. The Optimistic Logging pattern describes the method that keeps the logs of the communication events in the volatile memory of system constituents without blocking their execution until the logs are safely moved to stables storage. The Pessimistic Logging pattern describes the opposite logging method: the log of every communication must be first stored in stable storage before the system constituent is able to continue its execution. Finally, the Causal Logging pattern captures a hybrid method between the other two mentioned before, in an attempt to combine their benefits in terms of costs incurring to the system executions with and without errors. Note to the reader: The first two sections (Introduction and Background) contain a quick overview of fault tolerance terminology and system recovery concepts for the reader who might be not very familiar with those. If the reader feels comfortable with this terminology and concepts then he/she may skip these two sections and focus on the three patterns that follow and which are the material that the author would like to be reviewed in the writers workshop.
[1]
Anita Borg,et al.
A message system supporting fault tolerance
,
1983,
SOSP '83.
[2]
Brian Randell,et al.
System structure for software fault tolerance
,
1975,
IEEE Transactions on Software Engineering.
[3]
Willy Zwaenepoel,et al.
Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit
,
1992,
IEEE Trans. Computers.
[4]
Lorenzo Alvisi,et al.
Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery
,
2009,
2009 IEEE International Conference on Cluster Computing and Workshops.
[5]
David B. Johnson,et al.
Sender-Based Message Logging
,
1987
.
[6]
A. Prasad Sistla,et al.
Efficient distributed recovery using message logging
,
1989,
PODC '89.
[7]
Titos Saridakis,et al.
Design Patterns for Checkpoint-Based Rollback Recovery
,
2003
.
[8]
Titos Saridakis,et al.
A System of Patterns for Fault Tolerance
,
2002,
EuroPLoP.
[9]
Hermann Kopetz,et al.
Dependability: Basic Concepts and Terminology
,
1992
.
[10]
Victor P. Nelson.
Fault-tolerant computing: fundamental concepts
,
1990,
Computer.
[11]
Robert E. Strom,et al.
Optimistic recovery in distributed systems
,
1985,
TOCS.