A fault-tolerant algorithm for decentralized on-line quorum adaptation

A quorum-based distributed mutual exclusion protocol requires each processor in a distributed system to obtain permission from a quorum of processors before accessing a resource that cannot be concurrently shared. To prevent failed quorum members from blocking access to the resource, it is desirable to remove failed processors from quorums when failures are detected. This work addresses the problem of adapting quorums on-line, while a quorum-based mutual exclusion protocol continues to operate. To preserve the quorum intersection property that is required for mutual exclusion safety, it is necessary to coordinate changes made to the quorum data structures of different processors. A solution is given in the form of QADAPT, a decentralized algorithm that guarantees safe adaptation of quorums when processors fail. QADAPT enables any set of quorum adaptations that do not violate the quorum intersection property, and enables any set of faulty processors to be removed from quorums. QADAPT has optimal message passing cost and tolerates any number of processor (halting) failures. A distributed system model is assumed that provides only point-to-point messages with no message ordering. Results from an implementation show that the algorithm's execution time scales well in a system containing up to fifty networked workstations. Extensions of this work include on-line adaptation of quorums that are used to maintain replica consistency in distributed databases.

[1]  Bharat K. Bhargava,et al.  Replication Techniques in Distributed Systems , 1996, Advances in Database Systems.

[2]  Kenneth P. Birman,et al.  Understanding partitions and the 'no partition' assumption , 1993, 1993 4th Workshop on Future Trends of Distributed Computing Systems.

[3]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[4]  Idit Keidar,et al.  Dynamic voting for consistent primary components , 1997, PODC '97.

[5]  Hao Chen,et al.  An efficient method for mutual exclusion in truly distributed systems , 1994, 14th International Conference on Distributed Computing Systems.

[6]  Nancy A. Lynch,et al.  Robust emulation of shared memory using dynamic quorum-acknowledged broadcasts , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[7]  Robert H. Thomas,et al.  A Majority consensus approach to concurrency control for multiple copy databases , 1979, ACM Trans. Database Syst..

[8]  Sushil Jajodia,et al.  Dynamic voting algorithms for maintaining the consistency of a replicated database , 1990, TODS.

[9]  Abdelmadjid Bouabdallah On mutual exclusion in faulty distributed systems , 1994, OPSR.

[10]  Edsger W. Dijkstra,et al.  Solution of a problem in concurrent programming control , 1965, CACM.

[11]  LamportLeslie Time, clocks, and the ordering of events in a distributed system , 1978 .

[12]  Maurice Herlihy,et al.  Dynamic quorum adjustment for partitioned data , 1987, TODS.

[13]  Brian A. Coan,et al.  Limitations on database availability when networks partition , 1986, PODC '86.

[14]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.

[15]  Ichiro Suzuki,et al.  A distributed mutual exclusion algorithm , 1985, TOCS.

[16]  K. Birman,et al.  Understanding Partitions and the \ No Partition " , 1993 .

[17]  Donald B. Johnson,et al.  Effects of Replication on Data Availability , 1991, Int. J. Comput. Simul..

[18]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[19]  Mamoru Maekawa,et al.  A N algorithm for mutual exclusion in decentralized systems , 1985, TOCS.

[20]  Mostafa H. Ammar,et al.  The Grid Protocol: A High Performance Scheme for Maintaining Replicated Data , 1992, IEEE Trans. Knowl. Data Eng..

[21]  Divyakant Agrawal,et al.  An efficient and fault-tolerant solution for distributed mutual exclusion , 1991, TOCS.

[22]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[23]  Ronald P. Bianchini,et al.  The Synchronization Cost of On-line Quorum Adaptation , 1997 .

[24]  Yair Amir,et al.  Evaluating quorum systems over the Internet , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[25]  Hector Garcia-Molina,et al.  How to assign votes in a distributed system , 1985, JACM.

[26]  Satish K. Tripathi,et al.  A Robust Distributed Mutual Exclusion Algorithm , 1991, WDAG.

[27]  David Peleg,et al.  The Availability of Quorum Systems , 1995, Inf. Comput..

[28]  David R. Cheriton,et al.  Understanding the limitations of causally and totally ordered communication , 1994, SOSP '93.

[29]  Akhil Kumar,et al.  Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data , 1991, IEEE Trans. Computers.

[30]  Alberto Bartoli,et al.  Selecting a "primary partition" in partitionable asynchronous distributed systems , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[31]  D. Peleg,et al.  Crumbling Walls: A Class of High Availability Quorum Systems , 1994, PODC 1994.

[32]  Mark Bearden Fault-Tolerant On-Line Adaptation Of Quorum Assignments For Decentralized Coordination , 1998 .