On the quality of service of failure detectors based on control theory

The detection of failures is a fundamental issue for fault tolerance in distributed systems. Recently, many people have come to realize that failure detection ought to be provided as some form of generic service, similar to IP address lookup. However, this has not been successful so far; one of the reasons being the fact that classical failure detectors were not designed to satisfy several application requirements simultaneously. More specifically, traditional implementations of failure detectors are often tuned for running over local networks and fail to address some important problems found in wide-area distributed systems with a large number of monitored components. In this paper, we study the quality of service (QoS) of failure detectors. We first present a novel failure detector scheme combined with control theory that can help in solving or optimizing some problems. Furthermore, this paper discusses the design and analysis of implementing a scalable failure detection service for such large wide-area distributed systems considering dynamically adjusting the heartbeat streams, so that it satisfies the bottleneck router requirements. We further show how the online failure detector control algorithm can be used to design a controller, analyze the theoretical aspects of the proposed algorithm and verify its agreement. Simulation results show the efficiency of our scheme in terms of high utilization of the bottleneck link, fast response and good stability of the bottleneck router buffer occupancy as well as of the controlled sending rates. In conclusion, the new failure detector algorithm provides a better QoS.

[1]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[2]  K. Arvind,et al.  Probabilistic Clock Synchronization in Distributed Systems , 1994, IEEE Trans. Parallel Distributed Syst..

[3]  Joseph Y. Halpern,et al.  A decision-theoretic approach to reliable message delivery , 1998, Distributed Computing.

[4]  Robbert van Renesse,et al.  A Gossip-Style Failure Detection Service , 2009 .

[5]  Karl Sigman Stationary marked point processes , 1995 .

[6]  Michel Raynal,et al.  Time in Distributed System Models and Algorithms , 1999, Advances in Distributed Systems.

[7]  P. Billingsley,et al.  Probability and Measure , 1980 .

[8]  Yair Amir,et al.  Transis: a communication subsystem for high availability , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[9]  Gregory F. Pfister,et al.  In Search of Clusters , 1995 .

[10]  Naixue Xiong,et al.  A consolidation algorithm for multicast service using proportional control and neural network predictive techniques , 2005, Comput. Commun..

[11]  Joseph Y. Halpern,et al.  Least expected cost query optimization: an exercise in utility , 1999, PODS.

[12]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.

[13]  Naixue Xiong,et al.  LRC-RED: A Self-tuning Robust and Adaptive AQM Scheme , 2005, Sixth International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT'05).

[14]  Robert T. Braden,et al.  Requirements for Internet Hosts - Communication Layers , 1989, RFC.

[15]  Mohamed G. Gouda,et al.  Accelerated heartbeat protocols , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[16]  P. Veríssimo,et al.  Time, clocks and temporal order , 1999 .

[17]  M. Resnik Choices: An Introduction to Decision Theory , 1990 .

[18]  Flaviu Cristian,et al.  Fail-aware datagram service , 1999, IEE Proc. Softw..

[19]  Scott Shenker,et al.  Uniform versus priority dropping for layered video , 1998, SIGCOMM '98.

[20]  Rachid Guerraoui,et al.  Non blocking atomic commitment with an unreliable failure detector , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[21]  Flaviu Cristian,et al.  Fail-aware failure detectors , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[22]  Louise E. Moser,et al.  Totem: a fault-tolerant multicast group communication system , 1996, CACM.

[23]  Bernadette Charron-Bost,et al.  On the impossibility of group membership , 1996, PODC '96.

[24]  Marcos K. Aguilera,et al.  Using the Heartbeat Failure Detector for Quiescent Reliable Communication and Consensus in Partitionable Networks , 1999, Theor. Comput. Sci..

[25]  Naixue Xiong,et al.  A Resource-Based Server Performance Control for Grid Computing Systems , 2005, NPC.

[26]  Joseph Y. Halpern,et al.  A Decision-Theoretic Approach to Reliable Message Delivery , 1998, DISC.

[27]  Scott Shenker,et al.  Best-effort versus reservations: a simple comparative analysis , 1998, SIGCOMM '98.

[28]  F. Cristian,et al.  A fail-aware membership service , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[29]  Vasant Honavar,et al.  Analysis of Utility-Theoretic Heuristics for Intelligent Adaptive Network Routing , 1996, AAAI/IAAI, Vol. 1.

[30]  Naixue Xiong,et al.  On Designing a Novel PI Controller for AQM Routers Supporting TCP Flows , 2005, APWeb.

[31]  Steven Tuecke,et al.  The Anatomy of the Grid , 2003 .

[32]  Marcos K. Aguilera,et al.  Failure detection and consensus in the crash-recovery model , 2000, Distributed Computing.

[33]  Arnold O. Allen,et al.  Probablity, Statistics and Queueing Theory with Computer Science Applications, Second Edition , 1990, Int. CMG Conference.

[34]  Arnold O. Allen,et al.  Probability, statistics and queueing theory - with computer science applications (2. ed.) , 1981, Int. CMG Conference.

[35]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1996, JACM.

[36]  Robbert van Renesse,et al.  Horus: a flexible group communication system , 1996, CACM.

[37]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[38]  Özalp Babaoglu,et al.  RELACS: A communications infrastructure for constructing reliable applications in large-scale distributed systems , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[39]  K. Sigman Stationary marked point processes : an intuitive approach , 1995 .

[40]  Yair Amir,et al.  Transis: A Communication Sub-system for High Availability , 1992 .

[41]  Robbert van Renesse,et al.  Reliable Distributed Computing with the Isis Toolkit , 1994 .

[42]  Marcos K. Aguilera,et al.  On Quiescent Reliable Communication , 2000, SIAM J. Comput..

[43]  Flaviu Cristian,et al.  Probabilistic clock synchronization , 1989, Distributed Computing.

[44]  P. Altena,et al.  In search of clusters , 2007 .

[45]  Roy Friedman,et al.  Failure detectors in omission failure environments , 1997, PODC '97.

[46]  Santosh K. Shrivastava,et al.  Advances in Distributed Systems, Advanced Distributed Computing: From Algorithms to Systems , 1999 .

[47]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[48]  Naixue Xiong,et al.  An efficient flow control algorithm for multi-rate multicast networks , 2004, 2004 IEEE International Workshop on IP Operations and Management.

[49]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[50]  Rachid Guerraoui,et al.  Failure detectors as first class objects , 1999, Proceedings of the International Symposium on Distributed Objects and Applications.

[51]  Gregor von Laszewski,et al.  A fault detection service for wide area distributed computations , 2004, Cluster Computing.

[52]  Michel Raynal,et al.  Group membership failure detection: a simple protocol and its probabilistic analysis , 1999, Distributed Syst. Eng..