Dependability under malicious agreement in N-modular redundancy-on-demand systems

In a multiprocessor under normal loading conditions, idle processors offer a natural spare capacity. Previous work attempted to utilize this redundancy to overcome the limitations of classic diagnosability and modular redundancy techniques while providing significant fault tolerance. A common approach is task duplexing. The usefulness of this approach for critical applications, unfortunately, is seriously undermined by its susceptibility to agreement on faulty outcomes (malicious agreement). To assess dependability of duplexing under malicious agreement, we propose a stochastic model which dynamically profiles behavior in the presence of malicious faults. The model uses the so-called policy referred to as NMR on demand (NMROD). Each task in a multiprocessor is duplicated, with additional processors allocated for recovery as needed. NMROD relies on a fault model favoring response correctness over actual fault status, and integrates online repair to provide non-stop operation over an extended period.

[1]  Prathima Agrawal,et al.  Software implementation of a recursive fault tolerance algorithm on a network of computers , 1986, ISCA 1986.

[2]  Mohammad Sultan Alam Fault tolerance in modular multiprocessor systems , 1992 .

[3]  Prathima Agrawal,et al.  Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy , 1988, IEEE Trans. Computers.

[4]  Ralph E. Kuehn Computer Redundancy: Design, Performance, and Future , 1969 .

[5]  Hermann Kopetz,et al.  Fault tolerance, principles and practice , 1990 .

[6]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[7]  Dhiraj K. Pradhan,et al.  Roll-forward and rollback recovery: performance-reliability trade-off , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[8]  Chin-Long Wey,et al.  On a Multiprocessor System with Dynamic Redundancy , 1985, RTSS.

[9]  Y. Kubo,et al.  Fault-tolerant computer system with three symmetric computers , 1978, Proceedings of the IEEE.

[10]  Krishan K. Sabnani,et al.  Spare Capacity as a Means of Fault Detection and Diagnosis in Multiprocessor Systems , 1989, IEEE Trans. Computers.

[11]  Krishan K. Sabnani,et al.  The Comparison Approach to Multiprocessor Fault Diagnosis , 1987, IEEE Transactions on Computers.

[12]  Arthur E. Cooper,et al.  Development of On-Board Space Computer Systems , 1976, IBM J. Res. Dev..

[13]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[14]  Dhiraj K. Pradhan,et al.  Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture , 1994, IEEE Trans. Computers.

[15]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[16]  Gerald M. Masson,et al.  Greedy Diagnosis as the Basis of an Intermittent-Fault/ Transient-Upset Tolerant System Design , 1983, IEEE Transactions on Computers.

[17]  Prathima Agrawal,et al.  RAFT: A Recursive Algorithm for Fault Tolerance , 1985, ICPP.

[18]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[19]  Philip A. Bernstein,et al.  Sequoia: a fault-tolerant tightly coupled multiprocessor for transaction processing , 1988, Computer.