FUSE: Lightweight Guaranteed Distributed Failure Notification

FUSE is a lightweight failure notification service for building distributed systems. Distributed systems built with FUSE are guaranteed that failure notifications never fail. Whenever a failure notification is triggered, all live members of the FUSE group will hear a notification within a bounded period of time, irrespective of node or communication failures. In contrast to previous work on failure detection, the responsibility for deciding that a failure has occurred is shared between the FUSE service and the distributed application. This allows applications to implement their own definitions of failure. Our experience building a scalable distributed event delivery system on an overlay network has convinced us of the usefulness of this service. Our results demonstrate that the network costs of each FUSE group can be small; in particular, our overlay network implementation requires no additional liveness-verifying ping traffic beyond that already needed to maintain the overlay, making the steady state network load independent of the number of active FUSE groups.

[1]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[2]  Farnam Jahanian,et al.  Experimental Study of Internet Stabil-ity and Wide-Area Backbone Failures , 1998 .

[3]  Joseph Y. Halpern,et al.  Knowledge and common knowledge in a distributed environment , 1984, JACM.

[4]  Farnam Jahanian,et al.  Experimental study of Internet stability and backbone failures , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[5]  Deborah Estrin,et al.  The impact of routing policy on Internet paths , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[6]  Ben Y. Zhao,et al.  Towards a Common API for Structured Peer-to-Peer Overlays , 2003, IPTPS.

[7]  Jeffrey C. Mogul,et al.  Unveiling the transport , 2004, CCRV.

[8]  Stefan Savage,et al.  Understanding Availability , 2003, IPTPS.

[9]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[10]  Michael B. Jones,et al.  Herald: achieving a global event notification service , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[11]  Ross W. Callon,et al.  Use of OSI IS-IS for routing in TCP/IP and dual environments , 1990, RFC.

[12]  Peter Druschel,et al.  Pastry: Scalable, distributed object location and routing for large-scale peer-to- , 2001 .

[13]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[14]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[15]  Michael B. Jones,et al.  SkipNet: A Scalable Overlay Network with Practical Locality Properties , 2003, USENIX Symposium on Internet Technologies and Systems.

[16]  Michael D. Schroeder,et al.  Automatic reconfiguration in Autonet , 1991, SOSP '91.

[17]  Rachid Guerraoui,et al.  Failure detectors as first class objects , 1999, Proceedings of the International Symposium on Distributed Objects and Applications.

[18]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[19]  Barbara Liskov,et al.  Distributed programming in Argus , 1988, CACM.

[20]  Christopher Ré,et al.  WS-Membership - Failure Management in a Web-Services World , 2003, WWW.

[21]  Michael B. Jones,et al.  Subscriber/Volunteer Trees: Polite, Efficient Overlay Multicast Trees , 2004 .

[22]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[23]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[24]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[25]  Kenneth P. Birman,et al.  Reliable Distributed Systems: Technologies, Web Services, and Applications , 2005 .

[26]  Stefan Savage,et al.  Total Recall: System Support for Automated Availability Management , 2004, NSDI.

[27]  Danny Dolev,et al.  CONGRESS: connection-oriented group address resolution services , 1997, Other Conferences.

[28]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[29]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[30]  Abhinandan Das,et al.  SWIM: scalable weakly-consistent infection-style process group membership protocol , 2002, Proceedings International Conference on Dependable Systems and Networks.

[31]  George Candea,et al.  Automatic failure-path inference: a generic introspection technique for Internet applications , 2003, Proceedings the Third IEEE Workshop on Internet Applications. WIAPP 2003.

[32]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[33]  J. Moy,et al.  OSPF: Anatomy of an Internet Routing Protocol , 1998 .

[34]  Ben Y. Zhao,et al.  An Infrastructure for Fault-tolerant Wide-area Location and Routing , 2001 .

[35]  Amin Vahdat,et al.  Consistent and automatic replica regeneration , 2004, TOS.

[36]  Marcos K. Aguilera,et al.  Heartbeat: A Timeout-Free Failure Detector for Quiescent Reliable Communication , 1997, WDAG.

[37]  Miguel Castro,et al.  SCRIBE: The Design of a Large-Scale Event Notification Infrastructure , 2001, Networked Group Communication.

[38]  Ratul Mahajan,et al.  Understanding BGP misconfiguration , 2002, SIGCOMM '02.

[39]  Indranil Gupta,et al.  On scalable and efficient distributed failure detectors , 2001, PODC '01.

[40]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.

[41]  Ben Y. Zhao,et al.  Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and , 2001 .

[42]  Werner Vogels World wide failures , 1996, EW 7.

[43]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.

[44]  Dejan Kostic,et al.  Scalability and accuracy in a large-scale network emulator , 2002, CCRV.

[45]  Michael I. Jordan,et al.  A statistical learning approach to failure diagnosis , 2004 .

[46]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[47]  Eric A. Brewer,et al.  Using Runtime Paths for Macroanalysis , 2003, HotOS.

[48]  Robbert van Renesse,et al.  Adding high availability and autonomic behavior to Web services , 2004, Proceedings. 26th International Conference on Software Engineering.