A turnable protocol for symmetric surveillance in distributed systems

In distributed systems surveillance protocols are used for monitoring the status of remote sites. A remote site is regarded as being available as long as messages are received from this site, otherwise it is regarded as being unavailable. If a site becomes unavailable, this will be reported to other sites and recovery actions can be initiated. Using an example it will be shown that in certain cases it is necessary, that whenever some site S1 detects the unavailability of some other site S2, within a fixed amount of time S2 must also have detected an unavailability of S1. Unfortunately, this cannot be guaranteed by existing surveillance protocols. Another problem with existing protocols is, that remote sites are usually reported as being unavailable after being timed out only once, i.e. the loss of just one message might cause complete systems to back out. Two versions of a protocol for so-called symmetric surveillance are presented. Both guarantee, that if S1 detects the unavailability of S2 at time t0, then S2 (provided that S2 has not crashed) will become aware of this fact at t1 such that [t1 - t0] < Δ. This property is of special interest for handling network partitionings. Additionally, one of the versions is tunable, i.e. it can be specified, how many timeouts may occur before a site is regarded as being unavailable.