Beyond Fail-Stop: Wait-Free Serializability and Resiliency in the Presence of Slow-Down Failures

Abstract : Historically, database researchers have dealt with two kinds of process failures: fail-stop failures and malicious failures. Under the fail-stop assumption, processes fail by halting. Such failures are easily detectable. Under the malicious (or Byzantine) failure assumption, processes fail by behaving unpredictably, perhaps as adversaries. Such failures are not necessarily detectable. When system designers discuss fault tolerance, they typically restrict themselves to the problem of handling fail-stop failures only. This paper proposes an intermediate failure model and presents a practical algorithm for handling transactions under this model. The new failure model allows processes to fail by either slowing down or stopping: slow processes may later speed up, continue to proceed slowly, or, (eventually) stop. We call such failures slow-down failures. The model does not assume the ability to distinguish among these possibilities, say, by using a timeout mechanism, nor does it assume that it is possible to kill a slow process. Our algorithm, instead allows for a new process to be dispatched to do the job that had been assigned to a slow process. The problem is that several processes may end up doing the same task and interfere with one another. Our algorithm controls such interference while guaranteeing both serializability and resiliency.

[1]  Maurice Herlihy,et al.  Impossibility and universality results for wait-free synchronization , 1988, PODC '88.

[2]  Robert E. Tarjan,et al.  Making data structures persistent , 1986, STOC '86.

[3]  Paul G. Spirakis,et al.  Efficient robust parallel computations , 2018, STOC '90.

[4]  Maurice Herlihy,et al.  A methodology for implementing highly concurrent data structures , 1990, PPOPP '90.

[5]  Tony P. Ng,et al.  Replicated transactions , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[6]  Kenneth P. Birman,et al.  Replication and fault-tolerance in the ISIS system , 1985, SOSP '85.

[7]  Liba Svobodova Resilient Distributed Computing , 1984, IEEE Transactions on Software Engineering.

[8]  Vassos Hadzilacos,et al.  A theory of reliability in database systems , 1988, JACM.