Fault-containing self-stabilizing algorithms

Self-stabilization provides a non-masking approach to fault tolerance. Given this fact, one would hope that in a self-stabilizing system, the amount of disruption caused by a fault is proportional to the severity of the fault. However, this is not true for many self-stabilizing systems. Our paper addresses this weakness of distributed self-stabilizing systems by introducing the notion of fault containment. Informally, a fault-containing self-stabilizing algorithm is one that contains the effects of limited transient faults while retaining the property of self-st abilization. The paper begins with a formal framework for specifying and evaluating fault-containing self-stabilizing protocols. Then, it is shown that self-stabilization and fault containment are goals that can conflict. For example, it is shown that imposing a O(1) bound on the worst case recovery time from a l-faulty state necessitates added overhead for stabilization: for some tasks, the O(1) recovery time implies sfiabilization time cannot be within O(1) rounds from the optimum value. The paper then presents a transformer T that maps any non-reactive self-stabilizing algorithm P into an equivalent fault-containing self-stabilizing algorithm Pf that can repair any l-faulty state in O(1) time with O(1) space overhead. This transformation is baaed on a novel stabilizing timer paradigm that significantly simplifies the ti=k of fault containment. The paper concludes by generalizing the transformer ‘T into a parameterized transformer 7(k) such that for varying k we obtain varying performance measures for Pf.