Reducing critical failures for control algorithms using executable assertions and best effort recovery

Systems that use f+1 computer nodes to tolerate f node failures ordinarily require that the computer nodes have strong failure semantics, i.e. a node should either produce correct results or no results at all. We show that this requirement can be relaxed for control applications, as control algorithms inherently compensate for a class of value failures. Value failures occur when an error escapes the error detection mechanisms in the computer node and an erroneous value is sent to the actuators of the control system. Fault injection experiments show that 89% of the value failures caused by bit flips in a CPU had no or minor impact on the controlled object. However, the experiments also show that 11% of the value failures had severe consequences. These failures were caused by bit flips affecting the state variables of the control algorithm. Another set of fault injection experiments showed that the percentage of value failures with severe consequences was reduced to 3% when the state variables were protected with executable assertions and best-effort recovery mechanisms.

[1]  P. M. Melliar-Smith,et al.  Byzantine clock synchronization , 1984, PODC '84.

[2]  Chris J. Walter,et al.  The MAFT Architecture for Distributed Fault Tolerance , 1988, IEEE Trans. Computers.

[3]  E. Normand Single event upset at ground level , 1996 .

[4]  G. C. Messenger,et al.  Collection of Charge on Junction Nodes from Ion Tracks , 1982, IEEE Transactions on Nuclear Science.

[5]  K. Johansson,et al.  In-flight and ground testing of single event upset sensitivity in static RAMs , 1997 .

[6]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[7]  J. Karlsson,et al.  Application of Three Physical Fault Injection Techniques to the Experimental Assessment of the MARS Architecture , 1995 .

[8]  Johan Karlsson,et al.  GOOFI: generic object-oriented fault injection tool , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[9]  Johan Karlsson,et al.  A comparison of simulation based and scan chain implemented fault injection , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[10]  Johan Karlsson,et al.  GOOFI: generic object-oriented fault injection tool , 2001, 2001 International Conference on Dependable Systems and Networks.

[11]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .