A fail-aware membership service

We propose a new protocol that can be used to implement a partitionable membership service for timed asynchronous systems. The protocol is fail-aware in the sense that a process p knows at all times if its approximation of the set of processes in its partition is up-to-date or out-of-date. The protocol minimizes wrong suspicions of processes by giving processes a second chance to stay in the membership before they are removed. Our measurements show that the exclusion of live processes is rare and the crash detection times are good. The protocol guarantees that the memberships of two partitions never overlap.

[1]  Kenneth P. Birman,et al.  Using process groups to implement failure detection in asynchronous environments , 1991, PODC '91.

[2]  Paulo Veríssimo,et al.  Real-time communication , 1993 .

[3]  Frank B. Schmuck,et al.  Agreeing on Processor Group Membership in Timed Asynchronous Distributed Systems , 1995 .

[4]  Bradford B. Glade,et al.  The Horus System , 1993 .

[5]  F. Cristian Reaching Agreement on Processor Group Membership in Synchronous Distributed Systems Key Words: Communication Network { Distributed System { Failure Detection { Fault Tolerance { Real Time System { Replicated Data , 1991 .

[6]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[7]  Flaviu Cristian,et al.  A Highly Available Local Leader Election Service , 1999, IEEE Trans. Software Eng..

[8]  Matti A. Hiltunen,et al.  Properties of membership services , 1995, Proceedings ISADS 95. Second International Symposium on Autonomous Decentralized Systems.

[9]  Flaviu Cristian,et al.  Derivation of Fail-Aware Membership Service Specifications , 1998, IPPS/SPDP Workshops.

[10]  Flaviu Cristian,et al.  Fail-awareness in timed asynchronous systems , 1996, PODC '96.

[11]  Ragunathan Rajkumar,et al.  Processor group membership protocols: specification, design and implementation , 1993, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[12]  Andrew S. Tanenbaum,et al.  Group communication in the Amoeba distributed operating system , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[13]  Flaviu Cristian,et al.  Fail-aware datagram service , 1999, IEE Proc. Softw..

[14]  Flaviu Cristian,et al.  Fail-Aware Clock Synchronization , 1996 .

[15]  Shivakant Mishra,et al.  A Membership Protocol Based on Partial Order , 1992 .

[16]  Bernadette Charron-Bost,et al.  On the impossibility of group membership , 1996, PODC '96.

[17]  Yair Amir,et al.  Membership Algorithms for Multicast Communication Groups , 1992, WDAG.

[18]  S. Tanenbaum,et al.  GROUP COMMUNICATION IN THE AMOEBA DISTRIBUTED , 1991 .

[19]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1999, IEEE Trans. Parallel Distributed Syst..

[20]  Newtop: a fault-tolerant group communication protocol , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[21]  Louise E. Moser,et al.  Processor Membership in Asynchronous Distributed Systems , 1994, IEEE Trans. Parallel Distributed Syst..