Fault tolerance in digital controllers using software techniques

Microprocessor based systems for controlling gas supplies require very high levels of reliability for safety reasons. Non-redundant systems are considered to be inadequate, and an alternative approach is necessary. in digital systems, transient faults are as much as fifty times more common than permanent faults. Therefore mechanisms which allow for recovery from transients will provide large Improvements in reliability. However, to enable effective design of recovery mechanisms it Is necessary to understand failure modes. The results from practical interference tests, designed to simulate transient faults, are presented. They show that corruption to the correct flow of program execution is a common failure, and that subsequent instruction fetches can be performed from any of the memory locations. Under these conditions any value of operation code can be Interpreted as an instruction, including those undeclared by the manufacturers. Four commonly used microprocessors are investigated to establish the functions of the undeclared codes, and other undeclared operations are revealed. Analyses on the sequence of events following a random jump into the four main memory types of data, program, unused and input areas, are presented. Recovery from this type of execution can be achieved by the addition of restart codes into the areas, so that execution can transfer to a recovery routine. The effect of this mechanism on the recovery process is investigated. Finally, some methods of testing systems, to check the levels of reliability improvement obtained by these techniques, are considered.