Time-bounded cooperative recovery with the distributed real-time conversation scheme

Of several schemes proposed to handle the propagation of erroneous information among interacting processes in distributed and parallel computer systems, the distributed real-time conversation (DRC) scheme stands out in its fast forward recovery capability which is essential in safety-critical hard-real-time applications. However, previous formulations of the scheme remained at relatively abstract levels and practical models for their implementation in complex safety-critical real-time applications have not been established before. The core approach in the DRC scheme is to make a group of computing stations cooperate in recovery from hardware and software faults that may occur during their interaction. In this paper, we present a practical implementation model for the DRC scheme. A simple model of an anti-missile defense system is used to illustrate the main structuring principles of the DRC scheme and major components of the practical implementation model.

[1]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[2]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[3]  K. H. Kim,et al.  Implementation of the Conversation Scheme in Message-Based Distributed Computer Systems , 1992, IEEE Trans. Parallel Distributed Syst..

[4]  Andrea Clematis,et al.  A system architecture for fault tolerance in concurrent software , 1990, Computer.

[5]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications , 1989, IEEE Trans. Computers.

[6]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[7]  Algirdas Avižienis Fault-tolerance and fault-intolerance: Complementary approaches to reliable computing , 1975 .

[8]  K. H. Kim Design of loosely coupled processes capable of time-bounded cooperative recovery: the PTC/SL scheme , 1993, Comput. Commun..

[9]  K. H. Kim Approaches for System-Level Fault Tolerance in Distributed Real-Time Computer Systems , 1989, Fehlertolerierende Rechensysteme.

[10]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults , 1984, IEEE International Conference on Distributed Computing Systems.

[11]  John C. Knight,et al.  A Framework for Software Fault Tolerance in Real-Time Systems , 1983, IEEE Transactions on Software Engineering.

[12]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.