Application transparent fault management in fault tolerant Mach

A general purpose operating system fault management mechanism, the sentry, has been defined and implemented for the Mach 3.0 microkernel running a UNIX 4.3 BSD server. The value of a mechanism in the operating system domain is usually judged by two criteria: the suitability of the mechanism to support a wide range of policies and the performance cost of the mechanism. Similarly, in fault detection and recovery there are a relatively large number of strategies which can be mapped onto mechanisms and policies for fault tolerance. To highlight the properties of the sentry mechanism for fault management, the suitability and performance of the proposed mechanism are being evaluated for sample fault detection policies and for sample fault recovery policies. In the fault detection domain use of the mechanism to support assertion type policy is presented and evaluated through an example. Two recovery policies have been chosen and evaluated: checkpoint/restart and checkpoint/restart/journaling.

[1]  Tong-Ying Tony Juang,et al.  Efficient Algorithms for Crash Recovery in Distributed Systems , 1990, FSTTCS.

[2]  Helen Custer,et al.  Inside Windows NT , 1992 .

[3]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[4]  Maurice J. Bach The Design of the UNIX Operating System , 1986 .

[5]  Edward J. McCluskey,et al.  Executable assertions and flight software , 1984 .

[6]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[7]  Santosh K. Shrivastava,et al.  Reliable Computer Systems , 1985, Texts and Monographs in Computer Science.

[8]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[9]  Wei-Tek Tsai,et al.  A low overhead checkpointing and rollback recovery scheme for distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[10]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[11]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[12]  Alessandro Forin,et al.  UNIX as an Application Program , 1990, USENIX Summer.

[13]  Willy Zwaenepoel,et al.  Manetho: fault tolerance in distributed systems using rollback-recovery and process replication , 1994 .

[14]  Mark Russinovich,et al.  Application-transparent fault management , 1994 .

[15]  D. Orr,et al.  Mach: a foundation for open systems (operating systems) , 1989, Proceedings of the Second Workshop on Workstation Operating Systems.

[16]  Daniel P. Siewiorek,et al.  High-availability computer systems , 1991, Computer.

[17]  Leon G. Stucki,et al.  New assertion concepts for self-metric software validation , 1975, Reliable Software.

[18]  D. Jewett,et al.  Integrity S2: A Fault-Tolerant Unix Platform , 1991, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[19]  Zary Segall,et al.  Visualizing performance debugging , 1989, Computer.

[20]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[21]  Yuval Tamir,et al.  APPLICATION-TRANSPARENT ERROR-RECOVERY TECHNIQUES FOR MULTICOMPUTERS † , 1989 .

[22]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.