Primary-shadow consistency issues in the DRB scheme and the recovery time bound

The distributed recovery block (DRB) scheme is an approach for realizing both hardware and software fault tolerance in real time distributed and parallel computer systems. We point out that in order for the DRB scheme to yield a high fault coverage and a low recovery time bound, some important consistency requirements must be satisfied by the replicated application tasks in a DRB computing station. Newly identified approaches for meeting the consistency requirements, which involve, among other things, integration of network surveillance and reconfiguration (NSR) techniques with the DRB scheme, are presented. The paper then presents an analysis of the recovery time bound of the DRB scheme. The analysis is based on a modular structured concrete implementation model of the DRB scheme for local area network (LAN) based distributed computer systems, which is called the DRB/T LAN scheme and incorporates an NSR scheme and the newly identified consistency ensuring mechanisms. Finally, we consider approaches for applying the DRB scheme to new types of application computation segments that were not considered before and then discuss approaches for meeting the consistency requirements in such DRB stations. These approaches broaden the application range of the DRB scheme significantly.

[1]  Jean Arlat,et al.  Definition and analysis of hardware- and software-fault-tolerant architectures , 1990, Computer.

[2]  K. H. Kim,et al.  Action-level fault tolerance , 1995 .

[3]  Hermann Kopetz,et al.  TTP - A time-triggered protocol for fault-tolerant real-time systems , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[4]  Sang Hyuk Son Advances in real-time systems , 1995 .

[5]  Brian Randell,et al.  The Evolution of the Recovery Block Concept , 1994 .

[6]  Michael R. Lyu,et al.  Dependability Modeling for Fault-Tolerant Software and Systems , 1995 .

[7]  Kishor S. Trivedi,et al.  Real-time systems performance in the presence of failures , 1991, Computer.

[8]  Farokh B. Bastani,et al.  Toward dependable safety-critical software , 1996, Proceedings of WORDS'96. The Second Workshop on Object-Oriented Real-Time Dependable Systems.

[9]  K. H. Kim,et al.  A distributed fault tolerant architecture for nuclear reactor and other critical process control applications , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[10]  Kwang-Hae Kim,et al.  Approaches to implementation of a repairable distributed recovery block scheme , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[11]  Ram Chillarege,et al.  Measurement of failure rate in widely distributed software , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[12]  Brian Randell System structure for software fault tolerance , 1975 .

[13]  Hermann Kopetz,et al.  Fault-Tolerant Membership Service in a Synchronous Distributed Real-Time System , 1991 .