论文信息 - Decentralized Validation for Non-malicious Arbitrary Fault Tolerance in Paxos

Decentralized Validation for Non-malicious Arbitrary Fault Tolerance in Paxos

Fault-tolerant distributed systems offer high reliability because even if faults in their components occur, they do not exhibit erroneous behavior. Depending on the fault model adopted, hardware and software errors that do not result in a process crashing are usually not tolerated. To tolerate these rather common failures the usual solution is to adopt a stronger fault model, such as the arbitrary or Byzantine fault model. Algorithms created for this fault model, however, are considerably more complex and require more system resources than the ones developed for less strict fault models. One approach to reach a middle ground is the non-malicious arbitrary fault model. This model assumes it is possible to detect and filter faults with a given probability, if these faults are not created with malicious intent, allowing the isolation and mapping of these faults to benign faults. In this paper we describe how we incremented an implementation of active replication in the non-malicious fault model with a basic type of distributed validation, where a deviation from the expected algorithm behavior will make a process crash. We experimentally evaluate this implementation using a fault injection framework showing that it is feasible to extend the concept of non-malicious failures beyond hardware failures.

Gustavo M. D. Vieira | Rodrigo R. Barbieri | Enrique S. dos Santos

[1] Fred B. Schneider,et al. Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[2] G. M. D. Vieira,et al. Treplica : Ubiquitous Replication , 2007 .

[3] Leslie Lamport,et al. Fast Paxos , 2006, Distributed Computing.

[4] Gustavo M. D. Vieira,et al. Hardened Paxos through Consistency Validation , 2015, 2015 Brazilian Symposium on Computing Systems Engineering (SBESC).

[5] Leslie Lamport,et al. The part-time parliament , 1998, TOCS.

[6] G. M. D. Vieira,et al. Implementation of an Object-Oriented Specification for Active Replication Using Consensus , 2010 .

[7] Robert Griesemer,et al. Paxos made live: an engineering perspective , 2007, PODC '07.

[8] Miguel Correia,et al. Practical Hardening of Crash-Tolerant Systems , 2012, USENIX Annual Technical Conference.

[9] Leslie Lamport,et al. The Byzantine Generals Problem , 1982, TOPL.

[10] Miguel Castro,et al. Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[11] Christof Fetzer,et al. Automatically Tolerating Arbitrary Faults in Non-malicious Settings , 2013, 2013 Sixth Latin-American Symposium on Dependable Computing.