Restart services for highly available systems

This paper proposes a design methodology for building highly available systems. In addition, we describe a set of operating system services that can be used to achieve this goal. The techniques described are intended for a parallel environment and can be generalized for any distributed system. We describe a methodology for providing basic services for high availability, specific services for restart and an implementation of these services.

[1]  Danny Dolev,et al.  Highly available cluster: a case study , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[2]  Dhiraj K. Pradhan,et al.  Processor- and memory-based checkpoint and rollback recovery , 1993, Computer.

[3]  Daniel P. Siewiorek Fault tolerance in commercial computers , 1990, Computer.

[4]  Hector Garcia-Molina,et al.  Elections in a Distributed Computing System , 1982, IEEE Transactions on Computers.

[5]  Farnam Jahanian,et al.  Strong, weak and hybrid group membership , 1992, [1992 Proceedings] Second Workshop on the Management of Replicated Data.

[6]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.

[7]  Prithviraj Banerjee,et al.  Design and analysis of software reconfiguration strategies for hypercube multicomputers under multiple faults , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[8]  Flaviu Cristian,et al.  Agreeing on who is present and who is absent in a synchronous distributed system , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.