Experimental evaluation of the QoS of failure detectors on wide area network

This paper describes an experiment performed on wide area network to assess and fairly compare the quality of service provided by a large family of failure detectors. Failure detectors are a popular middleware mechanism used for improving the dependability of distributed systems and applications. Their QoS greatly influences the QoS that upper layers may provide. It is thus of uttermost importance to equip a system with an appropriate failure detector and to properly tune its parameters for the most desirable QoS to be provided. The paper first analyzes the QoS indicators and the structure of push-style failure detectors and then introduces the choices for estimators and safety margins used to build several (30) failure detectors. The experimental setup designed and implemented to allow a fair comparison of QoS of the several alternatives in a real representative experimental setting is then described. Finally the results obtained through the experiments and their interpretation are provided.

[1]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[2]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[3]  Rachid Guerraoui,et al.  Failure detectors as first class objects , 1999, Proceedings of the International Symposium on Distributed Objects and Applications.

[4]  Peter A. Dinda,et al.  An Extensible Toolkit for Resource Prediction In Distributed Systems , 1999 .

[5]  Ingrid Jansch-Pôrto,et al.  QoS of timeout-based self-tuned failure detectors: the effects of the communication delay predictor and the safety margin , 2004, International Conference on Dependable Systems and Networks, 2004.

[6]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1998, IEEE Trans. Parallel Distributed Syst..

[7]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[8]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[9]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[10]  David L. Mills,et al.  Internet time synchronization: the network time protocol , 1991, IEEE Trans. Commun..

[11]  Paulo Veríssimo,et al.  Quasi-Synchronism: a step away from the traditional fault-tolerant real-time system models , 1995 .

[12]  Pierre Sens,et al.  Implementation and performance evaluation of an adaptable failure detector , 2002, Proceedings International Conference on Dependable Systems and Networks.

[13]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[14]  Péter Urbán,et al.  Neko: a single environment to simulate and prototype distributed algorithms , 2001, Proceedings 15th International Conference on Information Networking.

[15]  Péter Urbán,et al.  Performance analysis of a consensus algorithm combining stochastic activity networks and measurements , 2002, Proceedings International Conference on Dependable Systems and Networks.

[16]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1971 .

[17]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[18]  Van Jacobson,et al.  Congestion avoidance and control , 1988, SIGCOMM '88.

[19]  Andrea Bondavalli,et al.  Quantitative Evaluation using Neko tool: NekoStat Extensions , 2004 .