Failure detectors for large-scale distributed systems

This paper discusses the problem of implementing a scalable failure detection service for grid systems. More specifically, traditional implementations of failure detectors are often tuned for running over local networks and fail to address important problems found in wide-area distributed systems, such as grid systems. We identify some of the most important problems raised in the context of grids. We then survey recent propositions that can help in solving some of these problems.

[1]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[2]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[3]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[4]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[5]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[6]  Steven Tuecke,et al.  The Anatomy of the Grid , 2003 .

[7]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1996, JACM.

[8]  Rachid Guerraoui,et al.  Failure detectors as first class objects , 1999, Proceedings of the International Symposium on Distributed Objects and Applications.

[9]  Indranil Gupta,et al.  On scalable and efficient distributed failure detectors , 2001, PODC '01.

[10]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.

[11]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[12]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[13]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[14]  Michel Raynal,et al.  An adaptive failure detection protocol , 2001, Proceedings 2001 Pacific Rim International Symposium on Dependable Computing.

[15]  Xavier Défago,et al.  Impact of a failure detection mechanism on the performance of consensus , 2001, Proceedings 2001 Pacific Rim International Symposium on Dependable Computing.