The guaranteed response paradigm is currently favored by some parts of the research community to design distributed hard real-time systems. It uses peak load and bounded failure rate assumptions to guarantee that the real-time system reacts to events of the controlled object within an a priori known time. Much of the off-the-shelf hardware and software makes it very hard to guarantee the bounded failure rate hypothesis at run-time. If this hypothesis, fundamental to all synchronous service implementations, can be violated at run-time, these implementations can be subject to unpredictable behavior. We address this problem by proposing an approach intended to support the construction of fail-safe or complex real-time systems: fail-awareness. This is a systematic approach to mask all failures whenever the failure rate is within a given bound and if not all failures can be masked, fail-aware services have to provide a well defined exception semantics.
The goal of fail-awareness is as follows: as long as the underlying communication and process services are affected only by a bounded rate of failures, all services have to provide their standard synchronous semantics, i.e. the system reacts within the given time, and each service knows that it provides its synchronous semantics. Each server maintains an (exception) indicator which tells the clients of the server if it currently provides its standard or an exception semantics. When the failure rate rises above some a priori given threshold, a server is allowed to switch to its exception semantics but it has to notify its clients that it has switched to exception semantics. An application can use the indicators to switch to a safe state in case of non-maskable failures.
We show how fail-awareness can be applied in partitionable systems, i.e. systems in which communication is not certain due to network failures or excessive performance failures. Our approach allows the servers in each partition to make progress independent of the servers in other partitions. In case a server provides its standard semantics, its indicator signals its clients to what logical partition they belong. Otherwise, it just signals that the server provides its exception semantics. We describe several fail-aware partitionable services to show the applicability of our approach.
[1]
Sam Toueg,et al.
Unreliable Failure Detectors for Asynchronous Systems
,
1991
.
[2]
Nancy A. Lynch,et al.
A New Fault-Tolerance Algorithm for Clock Synchronization
,
1988,
Inf. Comput..
[3]
Flaviu Cristian,et al.
Understanding fault-tolerant distributed systems
,
1991,
CACM.
[4]
Cynthia Dwork,et al.
Randomization in Byzantine Agreement
,
1989,
Adv. Comput. Res..
[5]
Nancy A. Lynch,et al.
Reaching approximate agreement in the presence of faults
,
1986,
JACM.
[6]
Sam Toueg,et al.
Unreliable failure detectors for asynchronous systems (preliminary version)
,
1991,
PODC '91.
[7]
Flaviu Cristian,et al.
Correct and Robust Programs
,
1984,
IEEE Transactions on Software Engineering.
[8]
Hagit Attiya,et al.
Renaming in an asynchronous environment
,
1990,
JACM.
[9]
Nancy A. Lynch,et al.
A new fault-tolerant algorithm for clock synchronization
,
1984,
PODC '84.
[10]
Danny Dolev,et al.
On the minimal synchronism needed for distributed consensus
,
1983,
24th Annual Symposium on Foundations of Computer Science (sfcs 1983).
[11]
Flaviu Cristian,et al.
Fail-aware failure detectors
,
1996,
Proceedings 15th Symposium on Reliable Distributed Systems.
[12]
Nancy A. Lynch,et al.
Impossibility of distributed consensus with one faulty process
,
1985,
JACM.
[13]
P. M. Melliar-Smith,et al.
Synchronizing clocks in the presence of faults
,
1985,
JACM.
[14]
Frank B. Schmuck,et al.
Agreeing on Processor Group Membership in Timed Asynchronous Distributed Systems
,
1995
.
[15]
Nancy A. Lynch,et al.
Consensus in the presence of partial synchrony
,
1988,
JACM.
[16]
F. Cristian.
Reaching Agreement on Processor Group Membership in Synchronous Distributed Systems Key Words: Communication Network { Distributed System { Failure Detection { Fault Tolerance { Real Time System { Replicated Data
,
1991
.