Experiences with Fault-Injection in a Byzantine Fault-Tolerant Protocol

The overall performance improvement in Byzantine fault-tolerant state machine replication algorithms has made them a viable option for critical high-performance systems. However, the construction of the proofs necessary to support these algorithms are complex and often make assumptions that may or may not be true in a particular implementation. Furthermore, the transition from theory to practice is difficult and can lead to the introduction of subtle bugs that may break the assumptions that support these algorithms. To address these issues we have developed Hermes, a fault-injector framework that provides an infrastructure for injecting faults in a Byzantine fault-tolerant state machine. Our main goal with Hermes is to help practitioners in the complex process of debugging their implementations of these algorithms, and at the same time increase the confidence of possible adopters, e.g., systems researchers, industry, by allowing them to test the implementations. In this paper, we discuss our experiences with Hermes to inject faults in BFT-SMaRt, a high-performance Byzantine fault-tolerant state machine replication library.

[1]  Liuba Shrira,et al.  HQ replication: a hybrid quorum protocol for byzantine fault tolerance , 2006, OSDI '06.

[2]  John Lane,et al.  Prime: Byzantine Replication under Attack , 2011, IEEE Transactions on Dependable and Secure Computing.

[3]  Wolfgang Schröder-Preikschat,et al.  AspectC++: an aspect-oriented extension to the C++ programming language , 2002 .

[4]  Jean-Philippe Martin,et al.  Fast Byzantine Consensus , 2006, IEEE Trans. Dependable Secur. Comput..

[5]  Michael K. Reiter,et al.  Fault-scalable Byzantine fault-tolerant services , 2005, SOSP '05.

[6]  Jacob A. Abraham,et al.  FERRARI: A Flexible Software-Based Fault and Error Injection System , 1995, IEEE Trans. Computers.

[7]  Ramakrishna Kotla,et al.  Zyzzyva: speculative byzantine fault tolerance , 2007, TOCS.

[8]  Gregor Kiczales,et al.  Aspect-oriented programming , 2001, ESEC/FSE-9.

[9]  Ravishankar K. Iyer,et al.  Measuring Fault Tolerance with the FTAPE Fault Injection Tool , 1995, MMB.

[10]  Kang G. Shin,et al.  DOCTOR: an integrated software fault injection environment for distributed real-time systems , 1995, Proceedings of 1995 IEEE International Computer Performance and Dependability Symposium.

[11]  Michael Dahlin,et al.  Making Byzantine Fault Tolerant Systems Tolerate Byzantine Faults , 2009, NSDI.

[12]  Jean-Charles Fabre,et al.  Failure analysis of an ORB in presence of faults , 2001 .

[13]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[14]  Falko Bause,et al.  Quantitative Evaluation of Computing and Communication Systems , 1995, Lecture Notes in Computer Science.

[15]  William H. Sanders,et al.  Loki: a state-driven fault injector for distributed systems , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[16]  Jie Xu,et al.  Assessing the dependability of OGSA middleware by fault injection , 2003, 22nd International Symposium on Reliable Distributed Systems, 2003. Proceedings..

[17]  Alysson Neves Bessani,et al.  From Byzantine Consensus to BFT State Machine Replication: A Latency-Optimal Transformation , 2012, 2012 Ninth European Dependable Computing Conference.

[18]  John Lane,et al.  Customizable Fault Tolerance forWide-Area Replication , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[19]  Satoshi Matsuoka,et al.  ECOOP'97 — Object-Oriented Programming , 1997, Lecture Notes in Computer Science.

[20]  Gregor Kiczales,et al.  Aspect-oriented programming , 1996, CSUR.

[21]  Jonathan Kirsch,et al.  Scaling Byzantine Fault-Tolerant Replication toWide Area Networks , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[22]  Andrea Bondavalli,et al.  Adaptare: Supporting automatic and dependable adaptation in dynamic environments , 2012, TAAS.

[23]  Arun Venkataramani,et al.  Separating agreement from execution for byzantine fault tolerant services , 2003, SOSP '03.

[24]  John Lane,et al.  Steward: Scaling Byzantine Fault-Tolerant Replication to Wide Area Networks , 2010, IEEE Transactions on Dependable and Secure Computing.

[25]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[26]  Farnam Jahanian,et al.  Testing of fault-tolerant and real-time distributed systems via protocol fault injection , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[27]  David Mazières,et al.  Beyond One-Third Faulty Replicas in Byzantine Fault Tolerant Systems , 2007, NSDI.

[28]  Flaviu Cristian,et al.  Centralized failure injection for distributed, fault-tolerant protocol testing , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.