Understanding, evaluating, and improving the availability of bft protocols

We live in an increasingly digital world where many critical services such as healthcare, financial services, and transportation are accessed over the Internet. These services must run correctly and continuously despite faults—deviations can lead to the loss of money, sensitive information, and even human life. Most commercial systems only tolerate benign faults, where a faulty component stops participating in the system as soon as it fails. However, these services also experience arbitrary (Byzantine) faults, for example software bugs, hardware faults, mis-configurations, and malicious attacks. Byzantine Fault-Tolerance (BFT) is a promising technique for tolerating such a wide range of faults because no assumptions are made about the nature of the fault. However, successfully deploying BFT techniques in general-purpose distributed systems requires us to address many challenges. First, BFT protocols are complex and their performance characteristics in realistic operating conditions are not well understood. Moreover, the implementations of these protocols have been independently developed using different operating systems, languages, and workloads, making a fair comparison among them difficult. We present BFTSim, a simulation tool that simplifies the construction of BFT protocols and enables a comprehensive evaluation of these protocols in a wide range of realistic operating conditions. Second, many e-commerce services favor high availability over consistency. In these services, it is acceptable for a read operation to return a potentially stale value to ensure responsiveness. However, existing BFT techniques can not be directly applied in these systems because they favor strong consistency guarantees over availability. For example, a read always returns the result of the latest write operation and otherwise the read operation blocks. To bridge this gap, we present Zeno, a novel BFT protocol that provides a weak consistency guarantee to achieve higher availability. The contributions made by this thesis are important steps towards enabling the widespread adoption of BFT techniques in general-purpose distributed systems.