In distributed systems surveillance protocols are used for monitoring the status of remote sites. A remote site is regarded as being available as long as messages are received from this site, otherwise it is regarded as being unavailable. If a site becomes unavailable, this will be reported to other sites and recovery actions can be initiated. Using an example it will be shown that in certain cases it is necessary, that whenever some site S1 detects the unavailability of some other site S2, within a fixed amount of time S2 must also have detected an unavailability of S1. Unfortunately, this cannot be guaranteed by existing surveillance protocols. Another problem with existing protocols is, that remote sites are usually reported as being unavailable after being timed out only once, i.e. the loss of just one message might cause complete systems to back out.
Two versions of a protocol for so-called symmetric surveillance are presented. Both guarantee, that if S1 detects the unavailability of S2 at time t0, then S2 (provided that S2 has not crashed) will become aware of this fact at t1 such that [t1 - t0] < Δ. This property is of special interest for handling network partitionings. Additionally, one of the versions is tunable, i.e. it can be specified, how many timeouts may occur before a site is regarded as being unavailable.
[1]
Irving L. Traiger,et al.
The notions of consistency and predicate locks in a database system
,
1976,
CACM.
[2]
Philip A. Bernstein,et al.
The failure and recovery problem for replicated databases
,
1983,
PODC '83.
[3]
Danny Dolev,et al.
On the possibility and impossibility of achieving clock synchronization
,
1984,
STOC '84.
[4]
Bernd Walter,et al.
A Robust and Efficient Protocol for Checking the Availability of Remote Sites
,
1982,
Comput. Networks.
[5]
Danny Dolev,et al.
On the Possibility and Impossibility of Achieving Clock Synchronization
,
1986,
J. Comput. Syst. Sci..
[6]
Leslie Lamport,et al.
Using Time Instead of Timeout for Fault-Tolerant Distributed Systems.
,
1984,
TOPL.
[7]
Keith Marzullo,et al.
Maintaining the time in a distributed system
,
1983,
PODC '83.
[8]
Stephen Fox,et al.
Overview of an Ada compatible distributed database manager
,
1983,
SIGMOD '83.
[9]
Dale Skeen,et al.
Nonblocking commit protocols
,
1981,
SIGMOD '81.
[10]
Michael Hammer,et al.
Reliability mechanisms for SDD-1: a system for distributed databases
,
1980,
TODS.