A Survey of Rollback-Recovery Protocols

This survey covers rollback-recovery techniques that do not require special language constructs. In the first part of the survey we classify rollback-recovery protocols into checkpoint-based and log-based. Checkpoint-based protocols rely solely on checkpointing for system state restoration. Checkpointing can be coordinated, uncoordinated, or communication-induced. Log-based protocols combine checkpointing with logging of nondeterministic events, encoded in tuples called determinants. Depending on how determinants are logged, log-based protocols can be pessimistic, optimistic, or causal. Throughout the survey, we highlight the research issues that are at the core of rollback recovery and present the solutions that currently address them. We also compare the performance of different rollback-recovery protocols with respect to a series of desirable properties and discuss the issues that arise in the practical implementations of these protocols.

[1]  C. V. Ramamoorthy,et al.  Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.

[2]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[3]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[4]  Butler W. Lampson,et al.  Crash Recovery in a Distributed Data Storage System , 1981 .

[5]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[6]  Erol Gelenbe,et al.  Performance of rollback recovery systems under intermittent failures , 1978, CACM.

[7]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[8]  BabaogluÖzalp,et al.  Converting a swap-based system to do paging in an architecture lacking page-referenced bits , 1981 .

[9]  J. A. McDermid Checkpointing and Error Recovery in distributed Systems , 1981, ICDCS.

[10]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[11]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[12]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[13]  Krishna Kant A model for error recovery with global checkpointing , 1983, Inf. Sci..

[14]  Andrzej Duda,et al.  The Effects of Checkpointing on Program Execution Time , 1983, Inf. Process. Lett..

[15]  Kang G. Shin,et al.  Optimization criteria for checkpoint placement , 1984, CACM.

[16]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[17]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[18]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[19]  W. G. Wood Recovery Control of Communicating Processes in a Distributed System , 1985 .

[20]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[21]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[22]  Madalene Spezialetti,et al.  Efficient Distributed Snapshots , 1986, ICDCS.

[23]  A Antola,et al.  Backward error recovery in distributed systems , 1986 .

[24]  Thomas A. Cargill,et al.  Cheap hardware support for software debugging and profiling , 1987, ASPLOS.

[25]  Eli Gafni,et al.  A Software-Based Hardware Fault Tolerance Scheme for Multicomputers , 1987, ICPP.

[26]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..

[27]  Stuart I. Feldman,et al.  IGOR: a system for program debugging via reversible execution , 1988, PADD '88.

[28]  Parameswaran Ramanathan,et al.  Checkpointing and rollback recovery in a distributed system using common time base , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[29]  Randy Pausch,et al.  Adding input and output to the transactional model , 1988 .

[30]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[31]  K. H. Kim,et al.  Programmer-Transparent Coordination of Recovering Concurrent Processes: Philosophy and Rules for Efficient Implementation , 1988, IEEE Trans. Software Eng..

[32]  Jacques Malenfant,et al.  Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems , 1988, IEEE Trans. Computers.

[33]  Vaidy S. Sunderam,et al.  Process Migration in UNIX Networks , 1988, USENIX Winter.

[34]  R.E. Strom,et al.  A recoverable object store , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume II: Software track.

[35]  Michel Banâtre,et al.  Ensuring data security and integrity with a fast stable storage , 1988, Proceedings. Fourth International Conference on Data Engineering.

[36]  Jonathan M. Smith,et al.  Implementing remote fork() with checkpoint/restart , 1989 .

[37]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[38]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[39]  Thomas J. LeBlanc,et al.  A software instruction counter , 1989, ASPLOS III.

[40]  John C. Knight,et al.  On the provision of backward error recovery in production programming languages , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[41]  M. E. Staknis Sheaved memory: architectural support for state saving and restoration in pages systems , 1989, ASPLOS 1989.

[42]  Jean-Michel Hélary Observing Global States of Asynchronous Distributed Applications , 1989, WDAG.

[43]  Luke Lin,et al.  Using checkpoints to localize the effects of faults in distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[44]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[45]  Richard D. Schlichting,et al.  Preserving and using context information in interprocess communication , 1989, TOCS.

[46]  D. Morris,et al.  A non-intrusive checkpointing protocol , 1989, Eighth Annual International Phoenix Conference on Computers and Communications. 1989 Conference Proceedings.

[47]  Yuval Tamir,et al.  Application-transparent process-level error recovery for multicomputers , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[48]  Jason Gait A Checkpointing Page Store for Write-Once Optical Disk , 1990, IEEE Trans. Computers.

[49]  K.H. Kim,et al.  A highly decentralized implementation model for the programmer-transparent coordination (PTC) scheme for cooperative recovery , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[50]  Jonathan Walpole,et al.  Recovery with limited replay: fault-tolerant processes in Linda , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[51]  Willy Zwaenepoel,et al.  Output-Driven Distributed Optimistic Message Logging and Checkpointing , 1990 .

[52]  Rong Chen,et al.  Building a Fault-Tolerant System Based on Mach , 1990, USENIX MACH Symposium.

[53]  Jacob A. Abraham,et al.  Forward Recovery Using Checkpointing in Parallel Systems , 1990, ICPP.

[54]  Arthur P. Goldberg Transparent Recovery of Mach Applications , 1990, USENIX MACH Symposium.

[55]  Kun-Lung Wu,et al.  Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[56]  Zbigniew M. Wójcik,et al.  Fault tolerant distributed computing using atomic send-receive checkpoints , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[57]  Bharat K. Bhargava,et al.  Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms , 1990, [1990] Proceedings. Sixth International Conference on Data Engineering.

[58]  Andrew W. Appel,et al.  A runtime system , 1990, LISP Symb. Comput..

[59]  Meichun Hsu,et al.  Fast recovery in distributed shared virtual memory systems , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[60]  David B. Johnson,et al.  Distributed system fault tolerance using message logging and checkpointing , 1990 .

[61]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[62]  David B. Johnson,et al.  Transparent optimistic rollback recovery , 1991, OPSR.

[63]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[64]  James R. Russell,et al.  Optimistic failure recovery for very large networks , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[65]  Flaviu Cristian,et al.  A timestamp-based checkpointing protocol for long-lived distributed computations , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[66]  Rumen Stainov An asynchronous checkpointing service , 1991 .

[67]  Andrea Clematis,et al.  Process checkpointin primitives for fault tolerance: definitions and examples , 1992, Microprocess. Microsystems.

[68]  Amitabh Sinha,et al.  Checkpointing and recovery in a pipeline of transputers , 1992, Microprocess. Microprogramming.

[69]  Barton P. Miller,et al.  Optimal tracing and replay for debugging message-passing parallel programs , 1992, Supercomputing '92.

[70]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[71]  Jacob A. Abraham,et al.  Implementing Forward Recovery Using Checkpoints in Distributed Systems , 1992 .

[72]  Henri E. Bal,et al.  Transparent fault-tolerance in parallel Orca programs , 1992 .

[73]  Richard Y. Kain,et al.  Rollback Recovery in Distributed Systems Using Loosely Synchronized Clocks , 1992, IEEE Trans. Parallel Distributed Syst..

[74]  W. Kent Fuchs,et al.  Scheduling message processing for reducing rollback propagation , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[75]  Jacob A. Abraham,et al.  Compiler-assisted static checkpoint insertion , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[76]  C. R. Landau The checkpoint mechanism in KeyKOS , 1992, [1992] Proceedings of the Second International Workshop on Object Orientation in Operating Systems.

[77]  Michel Ruffin,et al.  KITLOG: a Generic Logging Service , 1992, SRDS.

[78]  Jiannong Cao,et al.  An abstract model of rollback recovery control in distributed systems , 1992, OPSR.

[79]  Dhiraj K. Pradhan,et al.  Virtual Checkpoints: Architecture and Performance , 1992, IEEE Trans. Computers.

[80]  B. R. Badrinath,et al.  Recording Distributed Snapshots Based on Causal Order of Message Delivery , 1992, Inf. Process. Lett..

[81]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[82]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[83]  Johan Vounckx,et al.  Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback , 1993 .

[84]  Yi-Min Wang,et al.  Space reclamation for uncoordinated checkpointing in message-passing systems , 1993 .

[85]  Mark Russinovich,et al.  Application transparent fault management in fault tolerant Mach , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[86]  Lorenzo Alvisi,et al.  Nonblocking and orphan-free message logging protocols , 1992, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[87]  Phil Kearns,et al.  Rollback based on vector time , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[88]  W. Kent Fuchs,et al.  Progressive retry for software error recovery in distributed systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[89]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[90]  Dhiraj K. Pradhan,et al.  Processor- and memory-based checkpoint and rollback recovery , 1993, Computer.

[91]  Sachin Garg,et al.  Improving the Speed of A Distributed Checkpointing Algorithm , 1993 .

[92]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[93]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[94]  Jian Xu,et al.  Adaptive message logging for incremental program replay , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[95]  David B. Johnson,et al.  Efficient transparent optimistic rollback recovery for distributed application programs , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[96]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[97]  W. Kent Fuchs,et al.  Relaxing consistency in recoverable distributed shared memory , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[98]  Bojan Groselj,et al.  Bounded and minimum global snapshots , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[99]  Jian Xu,et al.  Adaptive independent checkpointing for reducing rollback propagation , 1993, Proceedings of 1993 5th IEEE Symposium on Parallel and Distributed Processing.

[100]  Mukesh Singhal,et al.  Using logging and asynchronous checkpointing to implement recoverable distributed shared memory , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[101]  Tzi-cker Chiueh Polar: A Storage Architecture for Fast Checkpointing , 1993, J. Inf. Sci. Eng..

[102]  Parameswaran Ramanathan,et al.  Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System , 1993, IEEE Trans. Software Eng..

[103]  Gilles Muller,et al.  Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment , 1994, EDCC.

[104]  Victor F. Nicola,et al.  Checkpointing and the modeling of program execution time , 1994 .

[105]  Robert H. B. Netzer,et al.  Optimal tracing and incremental reexecution for debugging long-running programs , 1994, PLDI '94.

[106]  Kai Li,et al.  Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[107]  Erik Seligman,et al.  High-Level Fault Tolerance in Distributed Programs , 1994 .

[108]  J. Bruck,et al.  Efficient checkpointing over local area networks , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[109]  David Cummings,et al.  Checkpoint/rollback in a distributed system using coarse-grained dataflow , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[110]  W. Kent Fuchs,et al.  Reducing interprocessor dependence in recoverable distributed shared memory , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[111]  Dhiraj K. Pradhan,et al.  An efficient coordinated checkpointing scheme for multicomputers , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[112]  Dhiraj K. Pradhan,et al.  Recovery in Multicomputers with Finite Error Detection Latency , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[113]  Miguel Castro,et al.  A checkpoint protocol for an entry consistent shared memory system , 1994, PODC '94.

[114]  Yuval Tamir,et al.  Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[115]  Willy Zwaenepoel,et al.  Manetho: fault tolerance in distributed systems using rollback-recovery and process replication , 1994 .

[116]  Dennis Shasha,et al.  PLinda 2.0: a transactional/checkpointing approach to fault tolerant Linda , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[117]  Yi-Min Wang,et al.  Optimal message log reclamation for uncoordinated checkpointing , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[118]  B. R. Badrinath,et al.  Checkpointing distributed applications on mobile computers , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[119]  Georg Stellner,et al.  Consistent Checkpoints of PVM Applications , 1994 .

[120]  João Gabriel Silva,et al.  On the optimum recovery of distributed programs , 1994, Proceedings of Twentieth Euromicro Conference. System Architecture and Integration.

[121]  Andrea Clematis Fault tolerant programming for network based parallel computing , 1994, Microprocess. Microprogramming.

[122]  M. Moura Silva,et al.  Checkpointing SPMD applications on transputer networks , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[123]  W. Kent Fuchs,et al.  Consistent Global Checkpoints Based on Direct Dependency Tracking , 1994, Inf. Process. Lett..

[124]  Yi-Min Wang,et al.  Why optimistic message logging has not been used in telecommunications systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[125]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[126]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[127]  Manhoi Choy,et al.  On distributed object checkpointing and recovery , 1995, PODC '95.

[128]  Yi-Min Wang,et al.  Maximum and minimum consistent global checkpoints and their applications , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[129]  Lorenzo Alvisi,et al.  Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[130]  W. Kent Fuchs,et al.  Checkpoint Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[131]  Christine Morin,et al.  A Survey of Recoverable Distributed Shared Memory Systems , 1995 .

[132]  B. Randell,et al.  STATE RESTORATION IN DISTRIBUTED SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[133]  Gilbert Cabillic,et al.  The performance of consistent checkpointing in distributed shared memory systems , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[134]  Yennun Huang,et al.  An implementation and performance measurement of the progressive retry technique , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[135]  W. Kent Fuchs,et al.  Tight Upper Bound on Useful Distributed System Checkpoints , 1995 .

[136]  Micah Beck,et al.  Compiler-Assisted Memory Exclusion for Fast Checkpointing , 1995 .

[137]  Luís Moura Silva,et al.  Portable checkpointing and recovery , 1995, Proceedings of the Fourth IEEE International Symposium on High Performance Distributed Computing.

[138]  Jack J. Dongarra,et al.  Algorithm-based diskless checkpointing for fault tolerant matrix operations , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[139]  Jian Xu,et al.  Sender-based message logging for reducing rollback propagation , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[140]  Jonathan Walpole,et al.  MIST: PVM with Transparent Migration and Checkpointing , 1995 .

[141]  W. Kent Fuchs,et al.  Reduced overhead logging for rollback recovery in distributed shared memory , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[142]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[143]  Anne-Marie Kermarrec,et al.  A recoverable distributed shared memory integrating coherence and recoverability , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[144]  James S. Plank,et al.  Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[145]  Jack Dongarra,et al.  Fault tolerant matrix operations using checksum and reverse computation , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[146]  Lorenzo Alvisi Understanding the message logging paradigm for masking process crashes , 1996 .

[147]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[148]  Ge-Ming Chiu,et al.  Efficient Rollback-Recovery Technique in Distributed Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[149]  Luís Moura Silva,et al.  Portable transparent checkpointing for distributed shared memory , 1996, Proceedings of 5th IEEE International Symposium on High Performance Distributed Computing.

[150]  Lorenzo Alvisi,et al.  Trade-offs in implementing causal message logging protocols , 1996, PODC '96.

[151]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[152]  Jennifer L. Welch,et al.  Implementation of recoverable distributed shared memory by logging writes , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[153]  Michel Banâtre,et al.  Lessons from FTM: An Experiment in Design and Implementation of a Low-Cost Fault-Tolerant System , 1996, IEEE Trans. Reliab..

[154]  Mark A. Franklin,et al.  Checkpointing in Distributed Computing Systems , 1996, J. Parallel Distributed Comput..

[155]  Vijay K. Garg,et al.  How to recover efficiently and asynchronously when optimism fails , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[156]  Mark Russinovich,et al.  Replay for concurrent non-deterministic shared-memory applications , 1996, PLDI '96.

[157]  Makoto Takizawa,et al.  Distributed checkpointing based on influential messages , 1996, Proceedings of 1996 International Conference on Parallel and Distributed Systems.

[158]  Mukesh Singhal,et al.  Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[159]  Sean W. Smith,et al.  Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback , 1995, Proceedings 15th Symposium on Reliable Distributed Systems.

[160]  Nuno Neves,et al.  Using time to improve the performance of coordinated checkpointing , 1996, Proceedings of IEEE International Computer Performance and Dependability Symposium.

[161]  Achour Mostéfaoui,et al.  Efficient Message Logging for Uncoordinated Checkpointing Protocols , 1996, EDCC.

[162]  E. N. Elnozahy,et al.  Supporting nondeterministic execution in fault-tolerant systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[163]  Tzi-cker Chiueh,et al.  Evaluation of checkpoint mechanisms for massively parallel machines , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[164]  Nitin H. Vaidya On staggered checkpointing , 1996, Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing.

[165]  Jehoshua Bruck,et al.  An on-line algorithm for checkpoint placement , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[166]  Yi-Min Wang,et al.  Integrating checkpointing with transaction processing , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[167]  Roll-Forward and Rollback Recovery: Performance-Reliability Trade-Off , 1997, IEEE Trans. Computers.

[168]  Chong-Sun Hwang,et al.  Hybrid checkpointing protocol based on selective-sender-based message logging , 1997, Proceedings 1997 International Conference on Parallel and Distributed Systems.

[169]  W. Kent Fuchs,et al.  Progressive Retry for Software Failure Recovery in Message-Passing Applications , 1997, IEEE Trans. Computers.

[170]  B. Ramkumar,et al.  Portable checkpointing for heterogeneous architectures , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[171]  Jack J. Dongarra,et al.  Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing , 1997, J. Parallel Distributed Comput..

[172]  Nuno Neves,et al.  Adaptive recovery for mobile environments , 1997, CACM.

[173]  Ravishankar K. Iyer,et al.  An object-oriented testbed for the evaluation of checkpointing and recovery systems , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[174]  Jack Dongarra,et al.  Fault tolerant matrix operations for networks of workstations using multiple checkpointing , 1997, Proceedings High Performance Computing on the Information Superhighway. HPC Asia '97.

[175]  D. Manivannan,et al.  Finding Consistent Global Checkpoints in a Distributed Computation , 1997, IEEE Trans. Parallel Distributed Syst..

[176]  Rudy Lauwereins,et al.  User-triggered checkpointing: system-independent and scalable application recovery , 1997, Proceedings Second IEEE Symposium on Computer and Communications.

[177]  Robert H. B. Netzer,et al.  Replaying distributed programs without message logging , 1997, Proceedings. The Sixth IEEE International Symposium on High Performance Distributed Computing (Cat. No.97TB100183).

[178]  Makoto Takizawa,et al.  Object-based checkpoints in distributed systems , 1997, Proceedings Third International Workshop on Object-Oriented Real-Time Dependable Systems.

[179]  Makoto Takizawa,et al.  Checkpoint and rollback in asynchronous distributed systems , 1997, Proceedings of INFOCOM '97.

[180]  Erik Seligman,et al.  Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..

[181]  Jong Kim,et al.  Probabilistic checkpointing , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[182]  Achour Mostéfaoui,et al.  Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[183]  Achour Mostéfaoui,et al.  Virtual Precedence in Asynchronous Systems: Cencept and Applications , 1997, WDAG.

[184]  James S. Plank,et al.  Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[185]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[186]  L. Alvisi,et al.  Message Logging: Pessimistic, Optimistic, Causal, and Optimal , 1998, IEEE Trans. Software Eng..

[187]  Harrick M. Vin,et al.  The cost of recovery in message logging protocols , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[188]  Bharat K. Bhargava,et al.  Design and analysis of a hardware-assisted checkpointing and recovery scheme for distributed applications , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[189]  Sampath Rangarajan,et al.  Checkpoints-on-demand with active replication , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[190]  Luís Moura Silva,et al.  An experimental study about diskless checkpointing , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[191]  Mukesh Singhal,et al.  On the impossibility of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computing systems , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[192]  W. Kent Fuchs,et al.  PREACHES-portable recovery and checkpointing in heterogeneous systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[193]  Bruno Ciciani,et al.  A VP-accordant checkpointing protocol preventing useless checkpoints , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[194]  Luís Moura Silva,et al.  System-level versus user-defined checkpointing , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[195]  Adel Said Elmaghraby,et al.  An Analytical Model for Hybrid Checkpointing in Time Warp Distributed Simulation , 1998, IEEE Trans. Parallel Distributed Syst..

[196]  Xiaohui Wei,et al.  SFT: a consistent checkpointing algorithm with shorter freezing time , 1998, OPSR.

[197]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[198]  Harrick M. Vin,et al.  Low-overhead protocols for fault-tolerant file sharing , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[199]  Franco Zambonelli Distributed checkpoint algorithms to avoid roll-back propagation , 1998, Proceedings. 24th EUROMICRO Conference (Cat. No.98EX204).

[200]  E. N. Elnozahy,et al.  Support for Software Interrupts in Log-Based Rollback-Recovery , 1998, IEEE Trans. Computers.

[201]  Vijay K. Garg,et al.  A non-blocking recovery algorithm for causal message logging , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[202]  E. N. Elnozahy How safe is probabilistic checkpointing? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[203]  Mukesh Singhal,et al.  Low-cost checkpointing with mutable checkpoints in mobile computing systems , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[204]  Luís Moura Silva,et al.  Avoiding checkpoint contamination in parallel systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[205]  Nuno Neves,et al.  Coordinated checkpointing without direct coordination , 1998, Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248).

[206]  William R. Dieter,et al.  A user-level checkpointing library for POSIX threads programs , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[207]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[208]  Kai Li,et al.  Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..

[209]  Achour Mostéfaoui,et al.  Communication-Induced Determination of Consistent Snapshots , 1999, IEEE Trans. Parallel Distributed Syst..

[210]  Harrick M. Vin,et al.  Egida: an extensible toolkit for low-overhead fault-tolerance , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[211]  Heon Young Yeom,et al.  An asynchronous recovery scheme based on optimistic message logging for mobile computing systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[212]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .