Skip Ring Topology in FAST Failure Detection Service

This paper addresses the problem of communication among loosely coupled groups of nodes in distributed systems. We describe a novel proposal of logical communication topology based on skip list data structure. We enhance this structure to make it more resilient to failures. Its good self-stabilization characteristics are shown through extensive simulation experiments. We present this new concept in the context of our failure detection service, where we use it at a local communication level.

[1]  William Pugh,et al.  Skip lists: a probabilistic alternative to balanced trees , 1989, CACM.

[2]  Jerzy Brzeziński,et al.  A survey of software failure detector protocols , 2003 .

[3]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[4]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[5]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[6]  Takashi Chikayama,et al.  A scalable and efficient self-organizing failure detector for grid applications , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[7]  Indranil Gupta,et al.  On scalable and efficient distributed failure detectors , 2001, PODC '01.

[8]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1999, IEEE Trans. Parallel Distributed Syst..

[9]  Naohiro Hayashibara,et al.  Failure detectors for large-scale distributed systems , 2002, 21st IEEE Symposium on Reliable Distributed Systems, 2002. Proceedings..

[10]  Michael B. Jones,et al.  FUSE: Lightweight Guaranteed Distributed Failure Notification , 2004, OSDI.

[11]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[12]  Pierre Sens,et al.  Implementation and performance evaluation of an adaptable failure detector , 2002, Proceedings International Conference on Dependable Systems and Networks.

[13]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.