On the Quality of Service of Failure Detectors Based on Control Theory

The detection of failures is a fundamental issue for fault tolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup. However, this has not been successful so far; one of the reasons being the fact that classical failure detectors were not designed to satisfy several application requirements simultaneously. More specifically, traditional implementations of failure detectors are often tuned for running over local networks and fail to address some important problems found in wide-area distributed systems with a large number of monitored components. In this paper, we study the quality of service (QoS) of failure detectors. We first present a novel failure detector scheme combined with control theory that can help in solving or optimizing some problems. Furthermore, this paper discusses the design and analysis of implementing a scalable failure detection service for such large wide-area distributed systems considering dynamically adjusting the heartbeat streams, so that it satisfies the bottleneck router requirements. We further show how the online failure detector control algorithm can be used to design a controller, analyze the theoretical aspects of the proposed algorithm and verify its agreement. Simulation results show the efficiency of our scheme in terms of high utilization of the bottleneck link, fast response and good stability of the bottleneck router buffer occupancy as well as of the controlled sending rates. In conclusion, the new failure detector algorithm provides a better QoS.

[1]  Steven Tuecke,et al.  The Anatomy of the Grid , 2003 .

[2]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[3]  Robert T. Braden,et al.  Requirements for Internet Hosts - Communication Layers , 1989, RFC.

[4]  Naixue Xiong,et al.  A consolidation algorithm for multicast service using proportional control and neural network predictive techniques , 2005, Comput. Commun..

[5]  Naixue Xiong,et al.  LRC-RED: A Self-tuning Robust and Adaptive AQM Scheme , 2005, Sixth International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT'05).

[6]  Naixue Xiong,et al.  A Resource-Based Server Performance Control for Grid Computing Systems , 2005, NPC.

[7]  Robbert van Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[8]  Ian T. Foster,et al.  The data grid: Towards an architecture for the distributed management and analysis of large scientific datasets , 2000, J. Netw. Comput. Appl..

[9]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[10]  Rachid Guerraoui,et al.  Failure detectors as first class objects , 1999, Proceedings of the International Symposium on Distributed Objects and Applications.

[11]  Naixue Xiong,et al.  On Designing a Novel PI Controller for AQM Routers Supporting TCP Flows , 2005, APWeb.

[12]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[13]  Naixue Xiong,et al.  An efficient flow control algorithm for multi-rate multicast networks , 2004, 2004 IEEE International Workshop on IP Operations and Management.