Group Communication in Partitionable Distributed Systems

We give a formal specification and an implementation for a partitionable group communication service in asynchronous distributed systems. Our specification is motivated by the requirements for building "partition-aware" applications that can continue operating without blocking in multiple concurrent partitions and reconfigure themselves dynamically when partitions merge. The specified service guarantees liveness and excludes trivial solutions; it constitutes a useful basis for building realistic partition-aware applications; and it is implementable in practical asynchronous distributed systems where certain stability conditions hold.

[1]  Louise E. Moser,et al.  Extended virtual synchrony , 1994, 14th International Conference on Distributed Computing Systems.

[2]  Idit Keidar,et al.  Scalable group membership services for novel applications , 1997, Networks in Distributed Computing.

[3]  Bradford B. Glade,et al.  The Horus System , 1993 .

[4]  Andrew S. Tanenbaum,et al.  Group communication in the Amoeba distributed operating system , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[5]  Janet Murray K12 network: global education through telecommunications , 1993, CACM.

[6]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[7]  Özalp Babaoglu,et al.  RELACS: A communications infrastructure for constructing reliable applications in large-scale distributed systems , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[8]  Nancy A. Lynch,et al.  Specifying and using a partitionable group communication service , 1997, PODC '97.

[9]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[10]  Gil Neiger A new look at membership services (extended abstract) , 1996, PODC '96.

[11]  Danny Dolev,et al.  A framework for partitionable membership service , 1996, PODC '96.

[12]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[13]  Nancy A. Lynch,et al.  A dynamic view-oriented group communication service , 1998, PODC '98.

[14]  Andre Schiper,et al.  View Synchronous Communication in Large Scale Networks , 1995 .

[15]  Roy Friedman,et al.  Failure detectors in omission failure environments , 1997, PODC '97.

[16]  André Schiper,et al.  Virtually-synchronous communication based on a weak failure suspector , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[17]  Robbert van Renesse,et al.  Reliable Distributed Computing with the Isis Toolkit , 1994 .

[18]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[19]  Roy Friedman,et al.  Strong and weak virtual synchrony in Horus , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[20]  Nancy A. Lynch,et al.  Multicast Group Communication as a Base for a Load-Balancing Replicated Data Service , 1998, DISC.

[21]  Danny Dolev,et al.  The Transis approach to high availability cluster communication , 1996, CACM.

[22]  Newtop: a fault-tolerant group communication protocol , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[23]  Emmanuelle Anceaume,et al.  On the Formal Specification of Group Membership Services , 1994 .

[24]  Christoph Peter Malloth,et al.  Conception and implementation of a toolkit for building fault-tolerant distributed applications in large scale networks , 1996 .

[25]  Bernadette Charron-Bost,et al.  On the impossibility of group membership , 1996, PODC '96.

[26]  Kenneth P. Birman,et al.  Using process groups to implement failure detection in asynchronous environments , 1991, PODC '91.

[27]  Alberto Montresor,et al.  The Jgroup distributed object model , 1999, DAIS.

[28]  Alberto Montresor,et al.  Group Communication in Partitionable Systems: Specification and Algorithms , 2001, IEEE Trans. Software Eng..

[29]  Robbert van Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[30]  Louise E. Moser,et al.  The Totem single-ring ordering and membership protocol , 1995, TOCS.

[31]  Louise E. Moser,et al.  Totem: a fault-tolerant multicast group communication system , 1996, CACM.

[32]  Alberto Montresor,et al.  System support for partition-aware network applications , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[33]  Alberto Montresor,et al.  The Jgroup Reliable Distributed Object Model , 1999 .