Fault Tolerant Microprocessor System Design - a Case Study

In this paper we report on the design of an ultra-reliable micro-processor system based on the dual-symmetric computer concept. The system is designed to detect and report faults and to remain in operation while under repair. The system has been designed and built for the Melbourne Metropolitan Fire Board as a high reliability remote station in a distributed fire alarm monitoring system called METCOM. The core of each remote station is a network of three microprocessors. Two of these processors have identical software and run in task synchronization continually comparing task answers. The third processor runs a different set of tasks but whenever a fault is suspected the three processors go into a 'huddle' and a two out of three majority voting scheme is used to determine the faulty module. All peripheral cards are duplicated and faults in these cards are automatically isolated and reported. This paper will discuss this problems associated with the design and implementation of built-tolerance and fault-recovery technique in real time multiprocessor microprocessor systems.