A group membership service for large-scale grids

In this paper, we propose a decentralized group membership service that can be incorporated into existing grid middleware to make it more reliable. This service includes a flexible failure detector that adapts dynamically to changing network conditions and can be configured with a number of failure recovery strategies. Moreover, it disseminates information about membership changes (new processes, failures, etc.) in a scalable and efficient manner. We conducted a preliminary evaluation of the proposed service by simulating a grid with up to 140 nodes distributed across three domains separated by a wide-area network. This evaluation showed that the proposed service performs well both in the absence and in the presence of process failures.

[1]  Takashi Chikayama,et al.  A scalable and efficient self-organizing failure detector for grid applications , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[2]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[3]  Benjamin Satzger,et al.  A new adaptive accrual failure detector for dependable distributed systems , 2007, SAC '07.

[4]  Andrew S. Grimshaw,et al.  Legion: An Operating System for Wide-Area Computing , 1999 .

[5]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[6]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[7]  Amit Jain,et al.  Failure detection and membership management in grid environments , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[8]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[9]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[10]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[11]  Scott Shenker,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[12]  Hai Jin,et al.  ALTER: adaptive failure detection services for grids , 2005, 2005 IEEE International Conference on Services Computing (SCC'05) Vol-1.

[13]  Anne-Marie Kermarrec,et al.  Epidemic information dissemination in distributed systems , 2004, Computer.

[14]  João Leitão,et al.  HyParView: A Membership Protocol for Reliable Gossip-Based Broadcast , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[15]  Ami Marowka,et al.  The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[16]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[17]  Roberto Ierusalimschy,et al.  Lua—An Extensible Extension Language , 1996 .

[18]  Abhinandan Das,et al.  SWIM: scalable weakly-consistent infection-style process group membership protocol , 2002, Proceedings International Conference on Dependable Systems and Networks.

[19]  Fabio Kon,et al.  InteGrade: object‐oriented Grid middleware leveraging the idle computing power of desktop machines , 2004, Concurr. Pract. Exp..

[20]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[21]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.