论文信息 - A group membership service for large-scale grids

A group membership service for large-scale grids

In this paper, we propose a decentralized group membership service that can be incorporated into existing grid middleware to make it more reliable. This service includes a flexible failure detector that adapts dynamically to changing network conditions and can be configured with a number of failure recovery strategies. Moreover, it disseminates information about membership changes (new processes, failures, etc.) in a scalable and efficient manner. We conducted a preliminary evaluation of the proposed service by simulating a grid with up to 140 nodes distributed across three domains separated by a wide-area network. This evaluation showed that the proposed service performs well both in the absence and in the presence of process failures.

[1] Takashi Chikayama,et al. A scalable and efficient self-organizing failure detector for grid applications , 2005, The 6th IEEE/ACM International Workshop on Grid Computing, 2005..

[2] Ian T. Foster,et al. Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[3] Benjamin Satzger,et al. A new adaptive accrual failure detector for dependable distributed systems , 2007, SAC '07.

[4] Andrew S. Grimshaw,et al. Legion: An Operating System for Wide-Area Computing , 1999 .

[5] Ian Foster,et al. The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[6] Robbert van Renesse,et al. A Gossip-Style Failure Detection Service , 2009 .

[7] Amit Jain,et al. Failure detection and membership management in grid environments , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[8] Gregor von Laszewski,et al. A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[9] Sam Toueg,et al. Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[10] Kenneth P. Birman,et al. The process group approach to reliable distributed computing , 1992, CACM.

[11] Scott Shenker,et al. Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[12] Hai Jin,et al. ALTER: adaptive failure detection services for grids , 2005, 2005 IEEE International Conference on Services Computing (SCC'05) Vol-1.

[13] Anne-Marie Kermarrec,et al. Epidemic information dissemination in distributed systems , 2004, Computer.

[14] João Leitão,et al. HyParView: A Membership Protocol for Reliable Gossip-Based Broadcast , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[15] Ami Marowka,et al. The GRID: Blueprint for a New Computing Infrastructure , 2000, Parallel Distributed Comput. Pract..

[16] Nancy A. Lynch,et al. Consensus in the presence of partial synchrony , 1988, JACM.

[17] Roberto Ierusalimschy,et al. Lua—An Extensible Extension Language , 1996 .

[18] Abhinandan Das,et al. SWIM: scalable weakly-consistent infection-style process group membership protocol , 2002, Proceedings International Conference on Dependable Systems and Networks.

[19] Fabio Kon,et al. InteGrade: object‐oriented Grid middleware leveraging the idle computing power of desktop machines , 2004, Concurr. Pract. Exp..

[20] Soonwook Hwang,et al. A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.

[21] Marcos K. Aguilera,et al. On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.