Designing masking fault-tolerance via nonmasking fault-tolerance

Masking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults. By way of contrast, nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program executions converge to states from where programs continually (re)satisfy their specification. In this paper, we show that a practical method to design masking fault-tolerance is to first design nonmasking fault-tolerance and to then transform the nonmasking fault-tolerant program minimally so as to achieve masking fault-tolerance. We demonstrate this method by designing novel fully distributed programs for termination detection, mutual exclusion, and leader election, that are masking tolerant of any finite number of process fail-stops and/or repairs.

[1]  Bowen Alpern,et al.  Defining Liveness , 1984, Inf. Process. Lett..

[2]  Bowen Alpern,et al.  Proving Boolean Combinations of Deterministic Properties , 1987, Logic in Computer Science.

[3]  Dhananjay M. Dhamdhere,et al.  A Token Based k-Resilient Mutual Exclusion Algorithm for Distributed Systems , 1994, Inf. Process. Lett..

[4]  David Gries,et al.  The Science of Programming , 1981, Text and Monographs in Computer Science.

[5]  W. H. J. Feijen,et al.  Derivation of a termination detection algorithm for distributed computations , 1986 .

[6]  Francis Y. L. Chin,et al.  Optimal Resilient Ring Election Algorithms , 1987, WDAG.

[7]  Kerry Raymond,et al.  A tree-based algorithm for distributed mutual exclusion , 1989, TOCS.

[8]  Ten-Hwang Lai,et al.  An (N-1)-resilient algorithm for distributed termination detection , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[9]  Edsger W. Dijkstra,et al.  A Discipline of Programming , 1976 .

[10]  Doron A. Peled,et al.  A Compositional Framework for Fault Tolerance by Specification Transformation , 1994, Theor. Comput. Sci..

[11]  R. W. Witty,et al.  Safe programming , 1978 .

[12]  Anish Arora Efficient Reconfiguration of Trees: A Case Study in Methodical Design of Nonmasking Fault-Tolerant Programs , 1994, FTRTFT.

[13]  George Varghese,et al.  Constraint satisfaction as a basis for designing nonmasking fault-tolerance , 1996, J. High Speed Networks.

[14]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[15]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[16]  K. Mani Chandy,et al.  Parallel program design - a foundation , 1988 .

[17]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[18]  J-C. Laprie,et al.  DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[19]  Anish Arora,et al.  Closure and Convergence: A Foundation of Fault-Tolerant Computing , 1993, IEEE Trans. Software Eng..

[20]  Anish Arora,et al.  Multitolerant Barrier Synchronization , 1997, Inf. Process. Lett..

[21]  Subbarayan Venkatesan,et al.  Reliable protocols for distributed termination detection , 1989 .

[22]  Mukesh Singhal,et al.  A fault tolerant algorithm for distributed mutual exclusion , 1990, Proceedings Ninth Symposium on Reliable Distributed Systems.

[23]  Divyakant Agrawal,et al.  An efficient and fault-tolerant solution for distributed mutual exclusion , 1991, TOCS.

[24]  Anish Arora,et al.  Compositional design of multitolerant repetitive byzantine agreement , 1997, WSS.

[25]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[26]  Ten-Hwang Lai,et al.  An (N-1)-Resilient Algorithm for Distributed Termination Detection , 1995, IEEE Trans. Parallel Distributed Syst..

[27]  Edsger W. Dijkstra,et al.  Predicate Calculus and Program Semantics , 1989, Texts and Monographs in Computer Science.

[28]  Anish Arora,et al.  Distributed Reset , 1994, IEEE Trans. Computers.

[29]  Ten-Hwang Lai,et al.  Termination Detection for Dynamically Distributed Systems with Non-first-in-first-out Communication , 1986, J. Parallel Distributed Comput..

[30]  Reuven Bar-Yehuda,et al.  Fault Tolerant Distributed Majority Commitment , 1988, J. Algorithms.

[31]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.