A fault detection service for wide area distributed computations

The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to trade off timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.

[1]  Louise E. Moser,et al.  Totem: a fault-tolerant multicast group communication system , 1996, CACM.

[2]  Warren Smith,et al.  A Resource Management Architecture for Metacomputing Systems , 1998, JSSPP.

[3]  K. Moore,et al.  Scalable Networked Information Processing Environment (SNIPE) , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[4]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[5]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[6]  Andrew S. Grimshaw,et al.  Campus-Wide Computing : Early Results Using Legion At the University of Virginia , 1997, Int. J. High Perform. Comput. Appl..

[7]  Sape Mullender,et al.  Distributed systems , 1989 .

[8]  Ian Foster,et al.  The Grid 2 - Blueprint for a New Computing Infrastructure, Second Edition , 1998, The Grid 2, 2nd Edition.

[9]  Vern Paxson,et al.  Measurements and analysis of end-to-end Internet dynamics , 1997 .

[10]  Robbert van Renesse,et al.  Design and Performance of Horus: A Lightweight Group Communications System , 1994 .

[11]  Jon B. Weissman Gallop: The Benefits of Wide-Area Computing for Parallel Processing , 1998, J. Parallel Distributed Comput..

[12]  Kenneth P. Birman,et al.  The process group approach to reliable distributed computing , 1992, CACM.

[13]  Jack J. Dongarra,et al.  Scalable networked information processing environment (SNIPE) , 1999, Future Gener. Comput. Syst..

[14]  Jean-Chrysostome Bolot,et al.  Characterizing End-to-End Packet Delay and Loss in the Internet , 1993, J. High Speed Networks.

[15]  Henri Casanova,et al.  Netsolve: a Network-Enabled Server for Solving Computational Science Problems , 1997, Int. J. High Perform. Comput. Appl..

[16]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[17]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.