Low-cost schemes for fault tolerance
暂无分享,去创建一个
Two aspects of fault tolerance are fault diagnosis and fault recovery. This dissertation studies both these aspects and presents low-cost schemes for achieving diagnosis and recovery. Two models for fault tolerance are studied, namely, modular redundancy and system-level diagnosis.
Modular redundant systems achieve fault detection and recovery by employing multiple replicas of each module. Such systems try to mask the failures, whenever possible. When high reliability is to be achieved with low redundancy, it is not always possible to mask the failures without retrying the computation. Check-pointing and rollback recovery is a technique that tries to minimize the expense of retrying. Multiprocessor fault tolerance schemes using modular redundancy are proposed here to minimize this expense further by exploiting the inherent redundancy offered by modular redundant systems. The proposed schemes are shown to improve the performance of modular redundant systems in the presence of faults, as compared to rollback schemes.
A trade-off exists between cost and performance of any fault tolerant system. Such a trade-off for modular redundant systems can be exploited to achieve high reliability at a low cost by trading the performance. The cost-performance trade-off is governed by the reliability-safety trade-off for the modular redundant systems. This trade-off is studied and the effect of increasing the level of redundancy on reliability-safety of a modular redundant system is analyzed.
System-level diagnosis is a graph-theoretic approach for diagnosing the status of the modules in a system. A method for minimizing the cost of diagnosis, named safe diagnosis, is proposed. It is shown that a large level of diagnostic safety in addition to existing diagnostic reliability can be achieved with a low overhead. Additionally, it is shown that achieving high safety does not increase the complexity of fault diagnosis algorithms.