论文信息 - Fault tolerance under UNIX

Fault tolerance under UNIX

The initial design for a distributed, fault-tolerant version of UNIX based on three-way atomic message transmission was presented in an earlier paper [3]. The implementation effort then moved from Auragen Systems1 to Nixdorf Computer where it was completed. This paper describes the working system, now known as the TARGON/32. The original design left open questions in at least two areas: fault tolerance for server processes and recovery after a crash were briefly and inaccurately sketched, rebackup after recovery was not discussed at all. The fundamental design involving three-way message transmission has remained unchanged. However, in addition to important changes in the implementation, server backup has been redesigned and is now more consistent with that of normal user processes. Recovery and rebackup have been completed in a less centralized and thus more efficient manner than previously envisioned. In this paper we review important aspects of the original design and note how the implementation differs from our original ideas. We then focus on the backup and recovery for server processes and the changes and additions in the design and implementation of recovery and rebackup.

[1] Joost Verhofstad,et al. Recovery Techniques for Database Systems , 1978, CSUR.

[2] Irving L. Traiger,et al. The Recovery Manager of the System R Database Manager , 1981, CSUR.

[3] George G. Robertson,et al. Accent: A communication oriented network operating system kernel , 1981, SOSP.

[4] Joel F. Bartlett,et al. A NonStop kernel , 1981, SOSP.

[5] Bernd Walter,et al. A Robust and Efficient Protocol for Checking the Availability of Remote Sites , 1982, Comput. Networks.

[6] David L. Presotto,et al. Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[7] Barton P. Miller,et al. Process migration in DEMOS/MP , 1983, SOSP '83.

[8] Barbara Liskov,et al. Guardians and Actions: Linguistic Support for Robust, Distributed Programs , 1983, TOPL.

[9] Anita Borg,et al. A message system supporting fault tolerance , 1983, SOSP '83.

[10] David L. Presotto,et al. A Reliable Broadcast Communication Mechanism , 1983 .

[11] Fred B. Schneider,et al. Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[12] Won Kim. Highly available systems for database applications , 1984, CSUR.

[13] Robert E. Strom,et al. Optimistic recovery in distributed systems , 1985, TOCS.

[14] Barbara Liskov,et al. Highly available distributed services and fault-tolerant distributed garbage collection , 1986, PODC '86.