Impact FD: An Unreliable Failure Detector Based on Process Relevance and Confidence in the System

This paper presents a new unreliable failure detector, called the Impact failure detector (FD) that, contrarily to the majority of traditional FDs, outputs a trust level value which expresses the degree of confidence in the system. An impact factor is assigned to each process and the trust level is equal to the sum of the impact factors of the processes not suspected of failure. Moreover, a threshold parameter defines a lower bound value for the trust level, over which the confidence in the system is ensured. In particular, we defined a f l exi bi l i t y property that denotes the capacity of the Impact FD to tolerate a certain margin of failures or false suspicions, i.e., its capacity of considering different sets of responses that lead the system to trusted states. The Impact FD is suitable for systems that present node redundancy, heterogeneity of nodes, clustering feature, and allow a margin of failures which does not degrade the confidence in the system. The paper also includes a timer-based distributed algorithm which implements an Impact FD, as well as its proof of correctness, for systems whose links are lossy asynchronous or for those whose all (or some) links are eventually timely. Performance evaluation results, based on PlanetLab [1] traces, confirm the degree of flexible applicability of our failure detector and that, due to the accepted margin of failure, both failures and false suspicions are more tolerated when compared to traditional unreliable failure detectors.

[1]  Péter Urbán,et al.  Definition and specification of accrual failure detectors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[2]  Michel Raynal,et al.  Anonymous asynchronous systems: the case of failure detectors , 2012, Distributed Computing.

[3]  Pierre Sens,et al.  Performance analysis of a hierarchical failure detector , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[4]  Francis C. Chu Reducing &Ω to ◊ W , 1998 .

[5]  Marcos K. Aguilera,et al.  On implementing omega with weak reliability and synchrony assumptions , 2003, PODC '03.

[6]  Maurice Herlihy,et al.  Threshold protocols in survivor set systems , 2010, Distributed Computing.

[7]  Marcos K. Aguilera,et al.  Communication-efficient leader election and consensus with limited link synchrony , 2004, PODC '04.

[8]  George Coulouris,et al.  Distributed systems - concepts and design , 1988 .

[9]  Rachid Guerraoui,et al.  Mutual exclusion in asynchronous systems with failure detectors , 2005, J. Parallel Distributed Comput..

[10]  Thomas A. Corbi,et al.  The dawning of the autonomic computing era , 2003, IBM Syst. J..

[11]  Felix C. Freiling,et al.  Secure Failure Detection and Consensus in TrustedPals , 2012, IEEE Transactions on Dependable and Secure Computing.

[12]  Pierre Sens,et al.  RepFD - Using Reputation Systems to Detect Failures in Large Dynamic Networks , 2015, 2015 44th International Conference on Parallel Processing.

[13]  Yuriy Brun,et al.  Smart Redundancy for Distributed Computation , 2011, 2011 31st International Conference on Distributed Computing Systems.

[14]  Flaviu Cristian,et al.  The Timed Asynchronous Distributed System Model , 1999, IEEE Trans. Parallel Distributed Syst..

[15]  Michel Raynal,et al.  On the road to the weakest failure detector for k-set agreement in message-passing systems , 2011, Theor. Comput. Sci..

[16]  Pierre Sens,et al.  Eventually Strong Failure Detector with Unknown Membership , 2012, Comput. J..

[17]  Rachid Guerraoui,et al.  The weakest failure detectors to solve certain fundamental problems in distributed computing , 2004, PODC '04.

[18]  Naohiro Hayashibara,et al.  The φ Accrual Failure Detector , 2004 .

[19]  Mikel Larrea,et al.  On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems , 2004, IEEE Trans. Computers.

[20]  Raimundo José de Araújo Macêdo,et al.  QoS self-configuring failure detectors for distributed systems , 2010, DAIS'10.

[21]  Yanpei Liu,et al.  An Exponential Smoothing Adaptive Failure Detector in the Dual Model of Heartbeat and Interaction , 2014, J. Comput. Sci. Eng..

[22]  Shashi B. Rana,et al.  Fault Tolerance in Wireless Sensor Network , 2015 .

[23]  Pierre Sens,et al.  A Time-Free Byzantine Failure Detector for Dynamic Networks , 2012, 2012 Ninth European Dependable Computing Conference.

[24]  André Schiper,et al.  Uniform consensus is harder than consensus , 2004, J. Algorithms.

[25]  Paulo Veríssimo,et al.  Using Tailored Failure Suspectors to Support Distributed Cooperative Applications , 1995, Parallel and Distributed Computing and Systems.

[26]  Marcos K. Aguilera,et al.  On the quality of service of failure detectors , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[27]  Mikel Larrea,et al.  Implementing the weakest failure detector for solving the consensus problem , 2013, Int. J. Parallel Emergent Distributed Syst..

[28]  Pierre Sens,et al.  Eventual Leader Election in Evolving Mobile Networks , 2013, OPODIS.

[29]  Sam Toueg,et al.  The weakest failure detector for solving consensus , 1992, PODC '92.

[30]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[31]  Bernadette Charron-Bost,et al.  On the impossibility of group membership , 1996, PODC '96.

[32]  Marcos K. Aguilera,et al.  Failure detection and consensus in the crash-recovery model , 2000, Distributed Computing.

[33]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[34]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[35]  Noman Islam,et al.  A review of wireless sensors and networks' applications in agriculture , 2014, Comput. Stand. Interfaces.

[36]  Roy Friedman,et al.  Probabilistic Byzantine Tolerance for Cloud Computing , 2015, 2015 IEEE 34th Symposium on Reliable Distributed Systems (SRDS).

[37]  Rajashekhar C. Biradar,et al.  Fault tolerance in wireless sensor network using hand-off and dynamic power adjustment approach , 2013, J. Netw. Comput. Appl..

[38]  Miguel Correia,et al.  From Consensus to Atomic Broadcast: Time-Free Byzantine-Resistant Protocols without Signatures , 2006, Comput. J..