A supervisor-based semi-centralized network surveillance scheme and the fault detection latency bound

Network surveillance (NS) schemes facilitate fast learning by each interested fault free node in the system of the faults or repair completion events occurring in other parts of the system. Currently concrete real time NS schemes effective in distributed computer systems based on point to point network architectures are scarce. We present a semi centralized real time NS scheme effective in a variety of point to point networks, called the supervisor based NS (SNS) scheme. This scheme is highly scalable and can be implemented entirely in software using commercial off the shelf (COTS) components without requiring any special purpose hardware support. An efficient execution support for the scheme has been designed as a new extension of the DREAM kernel, a timeliness guaranteed operating system kernel model developed in the authors' laboratory. This design can be viewed as an implementation model which can be easily adapted to various commercial operating system kernels. The paper also presents an analysis of the SNS scheme on the basis of the implementation model to obtain some tight bounds on the fault detection latency.

[1]  K. H. Kim,et al.  Primary-shadow consistency issues in the DRB scheme and the recovery time bound , 1996, Proceedings of ISSRE '96: 7th International Symposium on Software Reliability Engineering.

[2]  Michael R. Lyu Software Fault Tolerance , 1995 .

[3]  Flaviu Cristian,et al.  Agreeing on who is present and who is absent in a synchronous distributed system , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[4]  Weijia Jia,et al.  RMP: fault-tolerant group communication , 1996, IEEE Micro.

[5]  J.L. Kim,et al.  A robust, distributed election protocol , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[6]  K. H. Kim,et al.  A timeliness-guaranteed kernel model-DREAM kernel-and implementation techniques , 1995, Proceedings Second International Workshop on Real-Time Computing Systems and Applications.

[7]  K. H. Kim,et al.  A distributed fault tolerant architecture for nuclear reactor and other critical process control applications , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[8]  K. H. Kim Design of Real-Time Fault-Tolerant Computing Stations , 1992, NATO ASI RTC.

[9]  Andrea Bondavalli,et al.  The Design of Distributed, Dependable Real-Time Systems Using a Functional Paradigm , 1992, NATO ASI RTC.

[10]  Wei-Tek Tsai,et al.  Fault-Tolerant Multicasting on Hypercubes , 1994, J. Parallel Distributed Comput..

[11]  Hermann Kopetz,et al.  TTP - A time-triggered protocol for fault-tolerant real-time systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[12]  K. H. Kim,et al.  Action-level fault tolerance , 1995 .

[13]  Louise E. Moser,et al.  The Totem system , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[14]  José Rufino,et al.  A low-level processor group membership protocol for LANs , 1993, [1993] Proceedings. The 13th International Conference on Distributed Computing Systems.

[15]  K. H. Kim,et al.  Fault-tolerant real-time objects , 1997, CACM.