FORTRESS: A System to Support Fail-Aware Real-Time Applications

Fortress is a support system for designing and implementing fault-tolerant distributed real-time systems that use commercial of the shelf (COTS) components. The main problem we address in Fortress is that services cannot always provide their standard properties due the possibility of missed deadlines, dropped messages and process crashes. Fortress allows clients to detect when a service cannot provide its standard semantics anymore due to unmasked failures. A service is fail-aware if it maintains an indicator that allows its clients to determine if the service provides it standard semantics or some predefined exception semantics. Fortress provides fail-aware clock synchronization, membership and atomic broadcast services. Indicators allow a fail-safe application to switch the system to a safe state in case not all failures can be masked.

[1]  I. Bey,et al.  Delta-4: A Generic Architecture for Dependable Distributed Computing , 1991, Research Reports ESPRIT.

[2]  Flaviu Cristian,et al.  Synchronous and Asynchronous Group Communication. , 1996 .

[3]  Louise E. Moser,et al.  Totem: a fault-tolerant multicast group communication system , 1996, CACM.

[4]  Flaviu Cristian,et al.  Fail-aware datagram service , 1999, IEE Proc. Softw..

[5]  Bradford B. Glade,et al.  The Horus System , 1993 .

[6]  Flaviu Cristian,et al.  Fault-tolerance in air traffic control systems , 1996, TOCS.

[7]  Danny Dolev,et al.  The Transis approach to high availability cluster communication , 1996, CACM.

[8]  Sam Toueg,et al.  Unreliable Failure Detectors for Asynchronous Systems , 1991 .

[9]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1998, IEEE Trans. Parallel Distributed Syst..

[10]  Flaviu Cristian,et al.  Probabilistic internal clock synchronization , 1994, Proceedings of IEEE 13th Symposium on Reliable Distributed Systems.

[11]  Flaviu Cristian,et al.  Fail-awareness in timed asynchronous systems , 1996, PODC '96.

[12]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[13]  David Powell Extra Performance Architecture (XPA) , 1991 .

[14]  Flaviu Cristian,et al.  Atomic Broadcast: From Simple Message Diffusion to Byzantine Agreement , 1995, Inf. Comput..

[15]  Günter Grünsteidl,et al.  TTP - A Protocol for Fault-Tolerant Real-Time Systems , 1994, Computer.

[16]  Bernadette Charron-Bost,et al.  On the impossibility of group membership , 1996, PODC '96.

[17]  F. Cristian,et al.  A fail-aware membership service , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[18]  Paulo Veríssimo,et al.  Real time and dependability concepts , 1993 .

[19]  Flaviu Cristian,et al.  Derivation of Fail-Aware Membership Service Specifications , 1998, IPPS/SPDP Workshops.

[20]  Flaviu Cristian,et al.  Fail-awareness: an approach to construct fail-safe applications , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.