Rigorous Development of an Embedded Fault-Tolerant System Based on Coordinated Atomic Actions

Describes our experience using coordinated atomic (CA) actions as a system structuring tool to design and validate a sophisticated and embedded control system for a complex industrial application that has high reliability and safety requirements. Our study is based on an extended production cell model, the specification and simulator for which were defined and developed by FZI (Forschungszentrum Informatik, Germany). This "fault-tolerant production cell" represents a manufacturing process involving redundant mechanical devices (provided in order to enable continued production in the presence of machine faults). The challenge posed by the model specification is to design a control system that maintains specified safety and liveness properties even in the presence of a large number and variety of device and sensor failures. Based on an analysis of such failures, we provide details of: (1) a design for a control program that uses CA actions to deal with both safety-related and fault tolerance concerns and (2) the formal verification of this design based on the use of model checking. We found that CA action structuring facilitated both the design and verification tasks by enabling the various safety problems (involving possible clashes of moving machinery) to be treated independently. Even complex situations involving the concurrent occurrence of any pairs of the many possible mechanical and sensor failures can be handled simply yet appropriately. The formal verification activity was performed in parallel with the design activity, and the interaction between them resulted in a combined exercise in "design for validation"; formal verification was very valuable in identifying some very subtle residual bugs in early versions of our design which would have been difficult to detect otherwise.

[1]  Elizabeth L. White,et al.  Application of dynamic reconfiguration in the design of fault tolerant production systems , 1998, Proceedings. Fourth International Conference on Configurable Distributed Systems (Cat. No.98EX159).

[2]  Claus Lewerentz,et al.  Formal Development of Reactive Systems , 1995, Lecture Notes in Computer Science.

[3]  Kenneth L. McMillan,et al.  Symbolic model checking , 1992 .

[4]  K. H. Kim,et al.  Approaches to Mechanization of the Conversation Scheme Based on Monitors , 1982, IEEE Transactions on Software Engineering.

[5]  Peter Liggesmeyer,et al.  Improving system reliability with automatic fault tree generation , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[6]  Cecília M. F. Rubira,et al.  Fault tolerance in concurrent object-oriented software through coordinated error recovery , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  Brian Randell,et al.  Developing Control Software for Production Cell II: Failure Analysis and System Design Using CA Actions , 1998 .

[8]  Santosh K. Shrivastava,et al.  An overview of the Arjuna distributed programming system , 1991, IEEE Software.

[9]  Brian Randell System structure for software fault tolerance , 1975 .

[10]  Brian Randell,et al.  Coordinated exception handling in real-time distributed object systems , 1999 .

[11]  Claus Lewerentz,et al.  Formal Development of Reactive Systems: Case Study Production Cell , 1995 .

[12]  E. Allen Emerson,et al.  Temporal and Modal Logic , 1991, Handbook of Theoretical Computer Science, Volume B: Formal Models and Sematics.

[13]  Rogério de Lemos,et al.  Coordinated atomic actions in modelling object cooperation , 1998, Proceedings First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98).

[14]  Santosh K. Shrivastava,et al.  Checked transactions in an asynchronous message passing environment , 1998, Proceedings First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98).

[15]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[16]  Santosh K. Shrivastava,et al.  The Design and Implementation of Arjuna , 1995, Comput. Syst..

[17]  Avelino Francisco Zorzo,et al.  Rigorous development of a safety-critical system based on coordinated atomic actions , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[18]  Brian Randell,et al.  Using coordinated atomic actions to design safety‐critical systems: a production cell case study , 1999 .

[19]  Jie Xu,et al.  Concurrent Exception Handling and Resolution in Distributed Object Systems , 2000, IEEE Trans. Parallel Distributed Syst..

[20]  Brian Randell,et al.  Error recovery in asynchronous systems , 1986, IEEE Transactions on Software Engineering.

[21]  Jie Xu,et al.  Coordinated exception handling in distributed object systems: from model to system implementation , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).