A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems

Fault Tolerance Techniques enable systems to perform tasks in the presence of faults. A checkpoint is a local state of a process saved on stable storage. In a distributed system, since the processes in the system do not share memory, a global state of the system is defined as a set of local states, one from each process. In case of a fault in distributed systems, checkpointing enables the execution of a program to be resumed from a previous consistent global state rather than resuming the execution from the beginning. In this way, the amount of useful processing lost because of the fault is significantly reduced. Checkpointing is an effective fault tolerant technique in distributed system as it avoids the domino effect and require minimum storage requirement. Most of the earlier coordinated checkpoint algorithms block their computation during checkpointing and forces minimum-process or nonblocking even though many of them may not be necessary or non-blocking minimum-process but takes useless checkpoints or reduced useless checkpoint but has higher synchronization message overhead or has high checkpoint request propagation time. In this paper, we discuss various issues related to the checkpointing for distributed systems and mobile computing environments. We also present a survey of some checkpointing algorithms for distributed systems.

[1]  William H. Sanders,et al.  Distributed snapshots for mobile computing systems , 2004, Second IEEE Annual Conference on Pervasive Computing and Communications, 2004. Proceedings of the.

[2]  Susan V. Vrbsky,et al.  Low-cost coordinated nonblocking checkpointing in mobile computing systems , 2003, Proceedings of the Eighth IEEE Symposium on Computers and Communications. ISCC 2003.

[3]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[4]  Mukesh Singhal,et al.  Mutable checkpoints: a new checkpointing approach for mobile computing systems , 1999, PODC '99.

[5]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[6]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[7]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[8]  Prashant Kumar,et al.  A synchronous checkpointing protocol for mobile distributed systems: probabilistic approach , 2007, Int. J. Inf. Comput. Secur..

[9]  李幼升,et al.  Ph , 1989 .

[10]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[11]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[12]  Charles E. Perkins,et al.  A mobile networking system based on Internet protocol , 1993, IEEE Personal Communications.

[13]  R. K. Chauhan,et al.  A Hybrid Coordinated Checkpointing Protocol for Mobile Computing Systems , 2006 .

[14]  Tong-Ying Tony Juang,et al.  Optimistic Crash Recovery Without Rolling Back , 2003 .

[15]  Achour Mostéfaoui,et al.  Communication-induced determination of consistent snapshots , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[16]  Pushpendra Singh,et al.  A Checkpointing Algorithm for Mobile Computing Environment , 2003, PWC.

[17]  David L. Russell,et al.  State Restoration in Systems of Communicating Processes , 1980, IEEE Transactions on Software Engineering.

[18]  Arup Acharya,et al.  SERVICES FOR NETWORKS WITH MOBILE HOSTS , 1995 .

[19]  Nam Thoai,et al.  Error detection in large-scale parallel programs with long runtimes , 2003, Future Gener. Comput. Syst..

[20]  WTMR--A New Fault Tolerance Technique for Wireless and Mobile Computing Systems , 2007, 11th IEEE International Workshop on Future Trends of Distributed Computing Systems (FTDCS'07).

[21]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[22]  Parveen Kumar A low-cost hybrid coordinated checkpointing protocol for mobile distributed systems , 2008, Mob. Inf. Syst..

[23]  V. Rajaraman,et al.  A survey of checkpointing algorithms for parallel and distributed computers , 2000 .

[24]  Flaviu Cristian,et al.  A timestamp-based checkpointing protocol for long-lived distributed computations , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[25]  Michel Raynal,et al.  Rollback-Dependency Trackability: A Minimal Characterization and Its Protocol , 2001, Inf. Comput..

[26]  D. Manivannan,et al.  Finding Consistent Global Checkpoints in a Distributed Computation , 1997, IEEE Trans. Parallel Distributed Syst..

[27]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[28]  Shahram Rahimi,et al.  A Novel Low-Overhead Recovery Approach for Distributed Systems , 2009, J. Comput. Networks Commun..

[29]  Mukesh Singhal,et al.  On Coordinated Checkpointing in Distributed Systems , 1998, IEEE Trans. Parallel Distributed Syst..

[30]  Achour Mostéfaoui,et al.  Communication-Induced Determination of Consistent Snapshots , 1999, IEEE Trans. Parallel Distributed Syst..

[31]  Mahadev Satyanarayanan,et al.  Disconnected Operation in the Coda File System , 1999, Mobidata.

[32]  Tong-Ying Tony Juang,et al.  Efficient algorithms for optimistic crash recovery , 1994, Distributed Computing.

[33]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[34]  B. R. Badrinath,et al.  Checkpointing distributed applications on mobile computers , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[35]  Lalit Kumar,et al.  Low overhead optimal checkpointing for mobile distributed systems , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[36]  P. Kumar,et al.  A non-intrusive minimum process synchronous checkpointing protocol for mobile distributed systems , 2005, 2005 IEEE International Conference on Personal Wireless Communications, 2005. ICPWC 2005..

[37]  Brian Randell System structure for software fault tolerance , 1975 .

[38]  Purnendu Sinha,et al.  Formal verification of dependable distributed protocols , 2003, Inf. Softw. Technol..

[39]  Chong-Sun Hwang,et al.  A causal message logging protocol for mobile nodes in mobile computing systems , 2004, Future Gener. Comput. Syst..

[40]  Achour Mostéfaoui,et al.  A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[41]  Mukesh Singhal,et al.  Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[42]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[43]  S. Venkatesan,et al.  Message-optimal incremental snapshots , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[44]  Mukesh Singhal,et al.  On the impossibility of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computing systems , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[45]  David B. Johnson,et al.  Distributed system fault tolerance using message logging and checkpointing , 1990 .

[46]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[47]  Mukesh Singhal,et al.  Checkpointing with mutable checkpoints , 2003, Theor. Comput. Sci..

[48]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[49]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[50]  D. Manivannan,et al.  Quasi-Synchronous Checkpointing: Models, Characterization, and Classification , 1999, IEEE Trans. Parallel Distributed Syst..

[51]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[52]  Jian Xu,et al.  Necessary and Sufficient Conditions for Consistent Global Snapshots , 1995, IEEE Trans. Parallel Distributed Syst..

[53]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[54]  Bruno Ciciani,et al.  Checkpointing Protocols in Distributed Systems with Mobile Hosts: A Performance Analysis , 1998, IPPS/SPDP Workshops.

[55]  L. Alvisi,et al.  Nonblocking and Orphan-Free Message Logging Protocols , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[56]  Lorenzo Alvisi Understanding the message logging paradigm for masking process crashes , 1996 .