A survey and review of the current state of rollback‐recovery for cluster systems

A variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long‐running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long‐running applications. Several fault tolerance approaches were investigated; however, it was found that rollback‐recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback‐recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails. Copyright © 2009 John Wiley & Sons, Ltd.

[1]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[2]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[3]  Andrzej M. Goscinski,et al.  Toward Self Discovery for an Autonomic Cluster , 2005, ICA3PP.

[4]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[5]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[6]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[7]  V. Rajaraman,et al.  A survey of checkpointing algorithms for parallel and distributed computers , 2000 .

[8]  D. Manivannan,et al.  Finding Consistent Global Checkpoints in a Distributed Computation , 1997, IEEE Trans. Parallel Distributed Syst..

[9]  David B. Johnson,et al.  Efficient transparent optimistic rollback recovery for distributed application programs , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[10]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[11]  Willy Zwaenepoel,et al.  Manetho: fault tolerance in distributed systems using rollback-recovery and process replication , 1994 .

[12]  Ravishankar K. Iyer,et al.  Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[13]  Rudy Lauwereins,et al.  User-triggered checkpointing: system-independent and scalable application recovery , 1997, Proceedings Second IEEE Symposium on Computer and Communications.

[14]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[15]  C. R. Landau The checkpoint mechanism in KeyKOS , 1992, [1992] Proceedings of the Second International Workshop on Object Orientation in Operating Systems.

[16]  Daniel Marques,et al.  Recent advances in checkpoint/recovery systems , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[17]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[18]  Christine Morin,et al.  Checkpointing and recovery of shared memory parallel applications in a cluster , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[19]  R.E. Strom,et al.  A recoverable object store , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume II: Software track.

[20]  Andrzej M. Goscinski,et al.  Towards an operating system managing parallelism of computing on clusters , 2000, Future Gener. Comput. Syst..

[21]  Darin Anderson Providing Fault Tolerance in Distributed Systems , 2000 .

[22]  Jack Dongarra,et al.  PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing , 1995 .

[23]  Andrzej M. Goscinski,et al.  Transparent and Autonomic Rollback-Recovery in Cluster Systems , 2008, 2008 14th IEEE International Conference on Parallel and Distributed Systems.

[24]  Yi-Min Wang,et al.  Why optimistic message logging has not been used in telecommunications systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[25]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[26]  Willy Zwaenepoel,et al.  Fault Tolerance for a Workstation Cluster , 1993 .

[27]  Angelos Bilas,et al.  Fast and transparent recovery for continuous availability of cluster-based servers , 2006, PPoPP '06.

[28]  Andrzej M. Goscinski,et al.  The development of an efficient checkpointing facility exploiting operating systems services of the GENESIS cluster operating system , 2004, Future Gener. Comput. Syst..

[29]  Jochen Liedtke,et al.  On micro-kernel construction , 1995, SOSP.

[30]  Andrzej M. Goscinski,et al.  Distributed operating systems - the logical design , 1991 .

[31]  Sarmistha Neogy,et al.  Selective recovery in distributed systems , 2004, 2004 IEEE Region 10 Conference TENCON 2004..

[32]  Christine Morin,et al.  Towards an efficient single system image cluster operating system , 2002, Fifth International Conference on Algorithms and Architectures for Parallel Processing, 2002. Proceedings..

[33]  James S. Plank,et al.  An Overview of Checkpointing in Uniprocessor and DistributedSystems, Focusing on Implementation and Performance , 1997 .

[34]  Andrzej M. Goscinski,et al.  GENESIS: an efficient, transparent and easy to use cluster operating system , 2002, Parallel Comput..

[35]  Andrzej M. Goscinski,et al.  The RHODOS migration facility , 1998, J. Syst. Softw..

[36]  Nuno Neves,et al.  RENEW: a tool for fast and efficient implementation of checkpoint protocols , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[37]  Andrzej M. Goscinski,et al.  A Cluster Operating System Supporting Parallel Computing , 2001, Cluster Computing.

[38]  Michael Treaster,et al.  A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems , 2004, ArXiv.

[39]  Vijay K. Garg,et al.  How to recover efficiently and asynchronously when optimism fails , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[40]  Lorenzo Alvisi Understanding the message logging paradigm for masking process crashes , 1996 .

[41]  Andrzej M. Goscinski,et al.  A Group Communications Facility for Reliable Computing on Clusters , 2001, ISCA PDCS.

[42]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.