Performance and reliability achieved by a modular redundant system depend on the recovery scheme used. Typically, gain in performance using comparable resources results in reduced reliability. Several high performance computers are noted for small mean time to failure. Performance is measured here in terms of mean and variance of the task completion time, reliability being a task-based measure defined as the probability that a task is completed correctly. Two roll-forward schemes are compared with two rollback schemes for achieving recovery in duplex systems. The roll-forward schemes discussed here are based on a roll-forward checkpointing concept. Roll-forward recovery schemes achieve significantly better performance than rollback schemes by avoiding rollback in most common fault scenarios. It is shown that the roll-forward schemes improve performance with only a small loss in reliability as compared to rollback schemes.<<ETX>>
[1]
Jacob A. Abraham,et al.
Compiler-assisted static checkpoint insertion
,
1992,
[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.
[2]
R. Ramaswami,et al.
Book Review: Design and Analysis of Fault-Tolerant Digital Systems
,
1990
.
[3]
Dhiraj K. Pradhan,et al.
Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture
,
1994,
IEEE Trans. Computers.
[4]
Prathima Agrawal,et al.
Fault Tolerance in Multiprocessor Systems without Dedicated Redundancy
,
1988,
IEEE Trans. Computers.
[5]
Jacob A. Abraham,et al.
Forward Recovery Using Checkpointing in Parallel Systems
,
1990,
ICPP.
[6]
Nitin Hemant Vaidya,et al.
Low-cost schemes for fault tolerance
,
1993
.