Separating recovery strategies from application functionality: experiences with a framework approach

Industry-oriented fault tolerance solutions for embedded distributed systems should be based on adaptable, reusable elements. Software-implemented fault tolerance can provide such flexibility via the presented framework approach. It consists of (1) a library of fault tolerance functions, (2) a backbone coordinating these functions, and (3) a language expressing configuration and recovery. This language is a sort of ancillary application layer, separating recovery aspects from functional ones. Such a framework approach allows for a flexible combination of the available hardware redundancy with software-implemented fault tolerance. This increases the availability and reliability of the application at a justifiable cost thanks to the re-usability of the library elements in different targets systems. It also increases the maintainability due to the separation of the functional behavior from the recovery strategies that are executed when an error is detected as the modifications to functional and nonfunctional behavior are, to some extent, independent and hence less complex. Practical experience is reported from the integration of this framework approach in an automation system for electricity distribution. This case study illustrates the power of software-based fault tolerance solutions and of the configuration-and-recovery language ARIEL to allow flexibility and adaptability to changes in the environment.

[1]  Geert Deconinck,et al.  Software tool combining fault masking with user-defined recovery strategies , 1998, IEE Proc. Softw..

[2]  Jean-Charles Fabre,et al.  Implementing fault tolerant applications using reflective object-oriented programming , 1995 .

[3]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1999, IEEE Trans. Parallel Distributed Syst..

[4]  Yennun Huang,et al.  Software Fault Tolerance in the Application Layer , 1995 .

[5]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[6]  Andy J. Wellings,et al.  GUARDS: A Generic Upgradable Architecture for Real-Time Dependable Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[7]  Rudy Lauwereins,et al.  Recovery Languages: an Effective Structure for Software Fault Tolerance , 1998 .

[8]  Geert Deconinck,et al.  Stable memory in substation automation: a case study , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[9]  K. H. Kim ROAFTS: a middleware architecture for real-time object-oriented adaptive fault tolerance support , 1998, Proceedings Third IEEE International High-Assurance Systems Engineering Symposium (Cat. No.98EX231).

[10]  Daniel G. Bobrow,et al.  Book review: The Art of the MetaObject Protocol By Gregor Kiczales, Jim des Rivieres, Daniel G. and Bobrow(MIT Press, 1991) , 1991, SGAR.

[11]  Rudy Lauwereins,et al.  A software library, a control backbone and user-specified recovery strategies to enhance the dependability of embedded systems , 1999, Proceedings 25th EUROMICRO Conference. Informatics: Theory and Practice for the New Millennium.

[12]  Ravishankar K. Iyer,et al.  Chameleon: A Software Infrastructure for Adaptive Fault Tolerance , 1999, IEEE Trans. Parallel Distributed Syst..