Time-Based Coordinated Checkpointing

Distributed systems are being used to support the execution of applications ranging from long-running scientific simulators to e-commerce on the Internet. In this type of environment, the failure of one of its components, either a computer or the network, may prevent other components from completing their tasks. Since the probability of failure increases with the number of computers and execution time, it is likely that these applications will be interrupted unless provision is made for failure handling. In this thesis we address the problem of fault recovery in distributed systems. The thesis describes two variations of a coordinated checkpoint protocol that uses time to remove most causes of overhead, and to avoid all types of direct coordination. The time-based protocol does not have to transmit extra messages, does not need to tag the application messages, and only accesses the stable storage when the checkpoints are saved. The thesis also describes a new coordinated checkpoint protocol that is well adapted to mobile environments. It uses time to indirectly coordinate the creation of new global states, and it saves two different types of checkpoints to adapt its behavior to the current network characteristics. Traditional techniques for fault diagnosis in distributed systems, either based on watch-dogs or polling, exchange performance with detection latency. The thesis introduces a complementary mechanism that uses the error codes returned by the stream sockets. Since these errors are generated automatically when there is communication with a failed process, the mechanism incurs only in small overheads. Our results show that, in most cases, failures could be located using only the errors from the sockets. A large number of checkpoint-based recovery protocols have been proposed in the literature, however, most of them were never evaluated. The thesis describes the design and implementation of a run-time system for clusters of workstations that allows the rapid testing of checkpoint protocols with standard benchmarks. {\it RENEW - Recoverable Network of Workstations} provides a flexible set of operations that facilitates the integration of checkpoint and rollback recovery protocols.

[1]  Frank B. Schmuck,et al.  Agreeing on Processor Group Membership in Timed Asynchronous Distributed Systems , 1995 .

[2]  Nuno Neves,et al.  Coordinated checkpointing without direct coordination , 1998, Proceedings. IEEE International Computer Performance and Dependability Symposium. IPDS'98 (Cat. No.98TB100248).

[3]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[4]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[5]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[6]  Flaviu Cristian,et al.  Agreeing on who is present and who is absent in a synchronous distributed system , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  W. Kent Fuchs,et al.  Progressive retry for software error recovery in distributed systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[8]  Sudhakar M. Reddy,et al.  A Diagnosis Algorithm for Distributed Computing Systems with Dynamic Failure and Repair , 1984, IEEE Transactions on Computers.

[9]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[10]  Takashi Nanya,et al.  Hierarchical adaptive distributed system-level diagnosis applied for SNMP-based network fault management , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[11]  Nitin H. VaidyaDepartment,et al.  Another Two-Level Failure Recovery Scheme : Performance Impact of Checkpoint Placement andCheckpoint Latency , 1994 .

[12]  B. R. Badrinath,et al.  Checkpointing distributed applications on mobile computers , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[13]  Peter Steenkiste,et al.  Fail-Safe PVM: A Portable Package for Distributed Programming with Transparent Recovery , 1993 .

[14]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[15]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[16]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[17]  Sampath Rangarajan,et al.  A Distributed System-Level Diagnosis Algorithm for Arbitrary Network Topologies , 1995, IEEE Trans. Computers.

[18]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[19]  Charles E. Perkins,et al.  IP Mobility Support , 1996, RFC.

[20]  Henrique Madeira,et al.  Experimental assessment of parallel systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[21]  Martin A. W. Nemzow Implementing Wireless Networks , 1995 .

[22]  Mukesh Singhal,et al.  Using logging and asynchronous checkpointing to implement recoverable distributed shared memory , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[23]  Yair Amir,et al.  Transis: A Communication Sub-system for High Availability , 1992 .

[24]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[25]  Patrick H. Worley,et al.  Parallel community climate model: Description and user`s guide , 1996 .

[26]  Nuno Neves,et al.  A study of a non-linear optimization problem using a distributed genetic algorithm , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[27]  Wei-Tek Tsai,et al.  A low overhead checkpointing and rollback recovery scheme for distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[28]  W. Kent Fuchs,et al.  Scheduling message processing for reducing rollback propagation , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[29]  W. Kent Fuchs,et al.  Fault detection using hints from the socket layer , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[30]  Sean W. Smith,et al.  Completely asynchronous optimistic recovery with minimal rollbacks , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[31]  Ronald P. Bianchini,et al.  The Adapt2 on-line diagnosis algorithm for general topology networks , 1992, [Conference Record] GLOBECOM '92 - Communications for Global Users: IEEE.

[32]  Vaduvur Bharghavan,et al.  Challenges and Solutions to Adaptive Computing and Seamless Mobility over Heterogeneous Wireless Networks , 1997, Wirel. Pers. Commun..

[33]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[34]  Peter A. Barrett,et al.  Using passive replicates in Delta-4 to provide dependable distributed computing , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[35]  John Zahorjan,et al.  The challenges of mobile computing , 1994, Computer.

[36]  Kenneth P. Birman,et al.  Consistent Failure Reporting in Reliable Communication Systems , 1993 .

[37]  Luke Lin,et al.  Checkpointing and rollback-recovery in distributed object based systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[38]  A. Fleischmann Distributed Systems , 1994, Springer Berlin Heidelberg.

[39]  Jian Xu,et al.  Adaptive message logging for incremental program replay , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[40]  José Rufino,et al.  A low-level processor group membership protocol for LANs , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[41]  Daniel S. Nydick,et al.  Practical application and implementation of distributed system-level diagnosis theory , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[42]  D. Manivannan,et al.  A low-overhead recovery technique using quasi-synchronous checkpointing , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[43]  Roberto Baldoni,et al.  An Index-Based Checkpointing Algorithm for Autonomous Distributed Systems , 1999, IEEE Trans. Parallel Distributed Syst..

[44]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[45]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[46]  Flaviu Cristian,et al.  Probabilistic internal clock synchronization , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[47]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..

[48]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[49]  Georg Stellner,et al.  CoCheck: checkpointing and process migration for MPI , 1996, Proceedings of International Conference on Parallel Processing.

[50]  Matti A. Hiltunen Membership and system diagnosis , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[51]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[52]  Hon Fung Li,et al.  Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery , 1987, Inf. Process. Lett..

[53]  James S. Plank,et al.  Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[54]  Shivakant Mishra,et al.  Consul: a communication substrate for fault-tolerant distributed programs , 1993, Distributed Syst. Eng..

[55]  Kun-Lung Wu,et al.  Recoverable Distributed Shared Virtual Memory , 1990, IEEE Trans. Computers.

[56]  Nuno Neves,et al.  Using time to improve the performance of coordinated checkpointing , 1996, Proceedings of IEEE International Computer Performance and Dependability Symposium.

[57]  Bharat K. Bhargava,et al.  A model for concurrent checkpointing and recovery using transactions , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[58]  Flaviu Cristian,et al.  A timestamp-based checkpointing protocol for long-lived distributed computations , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[59]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[60]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[61]  James S. Plank Efficient checkpointing on MIMD architectures , 1993 .

[62]  Hermann Kopetz,et al.  Distributed fault-tolerant real-time systems: the Mars approach , 1989, IEEE Micro.

[63]  Nuno Neves,et al.  Adaptive recovery for mobile environments , 1997, CACM.

[64]  Arthur P. Goldberg Transparent Recovery of Mach Applications , 1990, USENIX MACH Symposium.

[65]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[66]  W. Kent Fuchs,et al.  Reduced overhead logging for rollback recovery in distributed shared memory , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[67]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[68]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[69]  C. R. Kime,et al.  System diagnosis , 1986 .

[70]  Anne-Marie Kermarrec,et al.  A recoverable distributed shared memory integrating coherence and recoverability , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[71]  Parameswaran Ramanathan,et al.  Use of Common Time Base for Checkpointing and Rollback Recovery in a Distributed System , 1993, IEEE Trans. Software Eng..

[72]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[73]  P. Reynier,et al.  Active replication in Delta-4 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[74]  Zbigniew M. Wójcik,et al.  Fault tolerant distributed computing using atomic send-receive checkpoints , 1990, Proceedings of the Second IEEE Symposium on Parallel and Distributed Processing 1990.

[75]  J. R. Kenevan,et al.  A non-FIFO checkpointing protocol for distributed systems , 1991, [Proceedings] 1991 Symposium on Applied Computing.

[76]  Miguel Castro,et al.  Lightweight logging for lazy release consistent distributed shared memory , 1996, OSDI '96.

[77]  Janak H. Patel,et al.  Error Recovery in Shared Memory Multiprocessors Using Private Caches , 1990, IEEE Trans. Parallel Distributed Syst..

[78]  Ragunathan Rajkumar,et al.  Processor group membership protocols: specification, design and implementation , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[79]  Ravishankar K. Iyer,et al.  An object-oriented testbed for the evaluation of checkpointing and recovery systems , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[80]  Phil Kearns,et al.  Rollback based on vector time , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[81]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[82]  Miguel Castro,et al.  A checkpoint protocol for an entry consistent shared memory system , 1994, PODC '94.

[83]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[84]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[85]  K. H. Kim,et al.  An efficient decentralized approach to processor-group membership maintenance in real-time LAN systems: the PRHB/ED scheme , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[86]  Ravishankar K. Iyer,et al.  DEPEND: A Simulation-Based Environment for System Level Dependability Analysis , 1997, IEEE Trans. Computers.

[87]  Jonathan Walpole,et al.  MIST: PVM with Transparent Migration and Checkpointing , 1995 .

[88]  Roberto Baldoni,et al.  An index-based checkpointing algorithm for autonomous distributed systems , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[89]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[90]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[91]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[92]  Nitin H. Vaidya,et al.  On Checkpoint Latency , 1995 .

[93]  Ronald P. Bianchini,et al.  An Adaptive Distributed System-Level Diagnosis Algorithm and Its Implementation , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[94]  W. Richard Stevens,et al.  Unix network programming , 1990, CCRV.

[95]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[96]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[97]  Bharat K. Bhargava,et al.  Concurrent robust checkpointing and recovery in distributed systems , 1988, Proceedings. Fourth International Conference on Data Engineering.

[98]  Sang Hyuk Son,et al.  Distributed Checkpointing for Globally Consistent States of Databases , 1989, IEEE Transactions on Software Engineering.

[99]  W. Kent Fuchs,et al.  Relaxing consistency in recoverable distributed shared memory , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[100]  Yuval Tamir,et al.  Application-transparent process-level error recovery for multicomputers , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[101]  Lorenzo Alvisi,et al.  Nonblocking and orphan-free message logging protocols , 1992, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[102]  Randy H. Katz,et al.  The Bay Area Research Wireless Access Network (BARWAN) , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.