A Failure Detection Service for Internet-Based Multi-AS Distributed Systems

Failure detectors are one of the basic building blocks of fault-tolerant distributed systems. A failure detector is a distributed oracle that provides information about the state of processes of a distributed system. This work presents a failure detector service for Internet-based distributed systems that span multiple autonomous systems. The service is based on monitors which are capable of providing global process state information through a SNMP interface. A monitor executes on each network where processes are monitored. Monitors at different networks communicate across the Internet using Web Services. The system was implemented and evaluated for monitored processes running both at a single LAN and distributed throughout the world in Planet Lab. Experimental results are presented, showing CPU usage, failure detection latency, and mistake rate.

[1]  Van Jacobson,et al.  Congestion avoidance and control , 1988, SIGCOMM '88.

[2]  Raimundo José de Araújo Macêdo,et al.  A hybrid and adaptive model for fault-tolerant distributed computing , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[3]  Randy Presuhn,et al.  Management Information Base (MIB) for the Simple Network Management Protocol (SNMP) , 2002, RFC.

[4]  Vern Paxson,et al.  Computing TCP's Retransmission Timer , 2000, RFC.

[5]  Péter Urbán,et al.  An SNMP based failure detection service , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[6]  Paul Meyer,et al.  SNMP Applications , 1999, RFC.

[7]  Bert Wijnen,et al.  An Architecture for Describing Simple Network Management Protocol (SNMP) Management Frameworks , 2002, RFC.

[8]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.

[9]  Ramanathan Kavasseri Event MIB , 2000, RFC.

[10]  Dennis Shasha,et al.  The many faces of consensus in distributed systems , 1992, Computer.

[11]  Steven Waldbusser,et al.  Host Resources MIB , 1993, RFC.

[12]  Rachid Guerraoui,et al.  Introduction to reliable distributed programming , 2006 .

[13]  Donald D. Chamberlin,et al.  W3C World Wide Web Consortium , 2003 .

[14]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[15]  Lisandro Zambenedetti Granville,et al.  Managing computer networks using peer-to-peer technologies , 2005, IEEE Communications Magazine.

[16]  Randy Presuhn Version 2 of the Protocol Operations for the Simple Network Management Protocol (SNMP) , 2002, RFC.

[17]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[18]  Dan Romascanu,et al.  Alarm Management Information Base (MIB) , 2004, RFC.

[19]  Michael Luck,et al.  Transparent Fault Tolerance for Web Services Based Architectures , 2002, Euro-Par.

[20]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.