Fault Control Using Triple Modular Redundancy (TMR)

Operating Systems have been widely expanding in terms of capabilities and resources. One of the many unavoidable concerns is the occurrence of a fault in the system. A fault is a violation of the existing system. A fault leads to a single or multiple failure in the system. In order to avoid this type of failure, we need to remove or control the fault. The commonly used techniques for controlling and isolating faults in the system are replication and check pointing. This paper aims to provide control over the detected fault by using the antique technique of triple modular redundancy (TMR) which is a type of N-modular redundancy techniques. Although it has the highest form of reliability, it has not been used to create a fault tolerant system. In our paper, we propose a system using the technique of triple modular redundancy to effectively mask and mitigate the detected faults to provide uninterrupted usage of the entire operating system.

[1]  Miron Livny,et al.  Faults in Large Distributed Systems and What We Can Do About Them , 2005, Euro-Par.

[2]  Sathish S. Vadhiyar,et al.  Fault Tolerance on Large Scale Systems using Adaptive Process Replication , 2015, IEEE Transactions on Computers.

[3]  Chih-Wen Lu,et al.  A Multi-stage Fault-Tolerant Multiplier with Triple Module Redundancy (TMR) Technique , 2013, 2013 4th International Conference on Intelligent Systems, Modelling and Simulation.

[4]  Rüdiger Kapitza,et al.  Effectiveness of Fault Detection Mechanisms in Static and Dynamic Operating System Designs , 2014, 2014 IEEE 17th International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing.

[5]  Zizhong Chen,et al.  Algorithm-Based Fault Tolerance for Fail-Stop Failures , 2008, IEEE Transactions on Parallel and Distributed Systems.

[6]  Naresh R. Shanbhag,et al.  Soft N-Modular Redundancy , 2012, IEEE Transactions on Computers.

[7]  Martin Schulz,et al.  Exploiting Redundancy and Application Scalability for Cost-Effective, Time-Constrained Execution of HPC Applications on Amazon EC2 , 2016, IEEE Transactions on Parallel and Distributed Systems.

[8]  Wil J. van Gils A Triple Modular Redundancy Technique Providing Multiple-Bit Error Protection Without Using Extra Redundancy , 1986, IEEE Trans. Computers.

[9]  Fang Liu,et al.  Two New Space-Time Triple Modular Redundancy Techniques for Improving Fault Tolerance of Computer Systems , 2006, The Sixth IEEE International Conference on Computer and Information Technology (CIT'06).

[10]  Tobias Distler,et al.  Resource-Efficient Byzantine Fault Tolerance , 2016, IEEE Transactions on Computers.

[11]  Vijay K. Garg,et al.  Fault Tolerance in Distributed Systems Using Fused Data Structures , 2013, IEEE Transactions on Parallel and Distributed Systems.

[12]  A. Avizienis,et al.  Dependable computing: From concepts to design diversity , 1986, Proceedings of the IEEE.