A Distributed Recovery Block Approach to Fault-Tolerant Execution of Application Tasks in Hypercubes

An approach to fault-tolerant execution of real-time application tasks in hypercubes is proposed. The approach is based on the distributed recovery block (DRB) scheme and does not require special hardware mechanisms in support of fault tolerance. Each task is assigned to a pair of processors forming a DRB computing station for execution in a dual-redundant and self-checking mode. Assignment of all tasks in an application in such a form is called the full DRB mapping. The DRB scheme was developed as an approach to uniform treatment of hardware and software faults with the effect of fast forward recovery. However, if the system developer is concerned with hardware fault possibilities only, then forming DRB stations becomes a mechanical process not burdening the application software designer in any way. A procedure for converting an efficient nonredundant task-to-processor mapping into an efficient full DRB mapping is presented. >

[1]  K. G. Shin,et al.  Embedding triple-modular redundancy into a hypercube architecture , 1988, C3P.

[2]  Arnold L. Rosenberg,et al.  Cost Trade-offs in Graph Embeddings, with Applications , 1983, JACM.

[3]  Seyed Hossein Hosseini Fault-Tolerant Scheduling of Independent Tasks and Concurrent Fault-Diagnosis in Multiple Processor Systems , 1988, ICPP.

[4]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[5]  Kwang-Hae Kim,et al.  Approaches to implementation of a repairable distributed recovery block scheme , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[6]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[7]  J. Kim,et al.  A Top-Down Processor Allocation Scheme for Hypercube Computers , 1991, IEEE Trans. Parallel Distributed Syst..

[8]  Angela Y. Wu,et al.  Embedding of tree networks into hypercubes , 1985, J. Parallel Distributed Comput..

[9]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications , 1989, IEEE Trans. Computers.

[10]  Jake K. Aggarwal,et al.  A Mapping Strategy for Parallel Processing , 1987, IEEE Transactions on Computers.

[11]  Shahid H. Bokhari,et al.  On the Mapping Problem , 1981, IEEE Transactions on Computers.

[12]  Wei-Tek Tsai,et al.  An efficient multi-dimensional grids reconfiguration algorithm on hypercube , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13]  James P. Black,et al.  Redundancy in Data Structures: Improving Software Fault Tolerance , 1980, IEEE Transactions on Software Engineering.

[14]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults , 1984, IEEE International Conference on Distributed Computing Systems.

[15]  Myron Hecht,et al.  A distributed fault tolerant architecture for nuclear reactor control and safety functions , 1989, [1989] Proceedings. Real-Time Systems Symposium.

[16]  Chita R. Das,et al.  A Processor Allocation Scheme for Hypercube Computers , 1989, ICPP.