Abstractions for asynchronous distributed computing with malicious players

In modern distributed systems, failures are the norm rather than the exception. In many cases, these failures are not benign. Settings such as the Internet might incur malicious (also called Byzantine or arbitrary) behavior and asynchrony. As a result, and perhaps not surprisingly, research on asynchronous Byzantine fault-tolerant (BFT) distributed systems is flourishing. Tolerating arbitrary behavior and asynchrony calls for very sophisticated algorithms. This is in particular the case with BFT solutions that aim to provide properties such as: (a) optimal resilience, i.e., tolerating as many Byzantine failures as possible and (b) optimal performance with respect to some relevant complexity metric. Most BFT algorithms are built from scratch or by modifying existing solutions in a non-modular manner, which often renders these algorithms difficult to understand and, consequently, impedes their wider adoption. We attribute this complexity to the lack of sufficient number of adequate abstractions for asynchronous BFT distributed computing. The motivation of this thesis is to propose reusable abstractions for devising asynchronous BFT distributed algorithms that are optimally resilient and/or have optimal complexity, with strong focus on one of the most important complexity metrics — time complexity (or latency). The abstractions proposed in this thesis are devised with three fundamental distributed applications in mind: (a) read/write storage (also called register), (b) consensus and (c) state machine replication (SMR). We demonstrate how to use our abstractions in these applications to devise asynchronous BFT algorithms that feature the best complexity among all algorithms we know of, in addition to optimal resilience. First, we introduce the notion of a refined quorum system (RQS) of some set S as a set of three classes of subsets (quorums) of S: first class quorums are also second class quorums, themselves being also third class quorums. First class quorums have large intersections with all other quorums, second class quorums typically have smaller intersections with those of the third class, the latter simply correspond to traditional quorums. The refined quorum system abstraction helps design algorithms that tolerate contention (process concurrency), arbitrarily long periods of asynchrony and the largest possible number of failures, but perform fast if few failures occur, the system is synchronous and there is no contention, i.e., under conditions that are assumed to be frequent in practice. In other words, RQS helps combine optimal resilience and optimal best-case time complexity. Intuitively, under uncontended and synchronous conditions, a distributed object implementation would expedite an operation if a quorum of the first class is accessed, then degrade gracefully depending on whether a quorum of the second or the third class is accessed. Our notion of RQS is devised assuming a general adversary structure, and this basically allows algorithms relying on RQS to relax the assumption of independent process failures. We illustrate the power of refined quorums by introducing two new optimal BFT atomic object implementations: an atomic storage and consensus algorithm. Our second abstraction is a novel timestamping mechanism called high resolution timestamps (HRts), which can be seen as a variation of a matrix clocks. Roughly speaking, a high resolution timestamp contains a matrix of local timestamps of (a subset of) processes as seen by (a subset of) other processes. Complementary to RQS, HRts simplify the design of BFT distributed algorithms that combine optimal resilience and worst-case time complexity. We apply high-resolution timestamps to design read/write storage algorithms in which HRts are used to detect and filter out Byzantine processes, which paves the path to the first BFT storage algorithms that combine optimal resilience with optimal worst-case time complexity. Finally, we introduce ABsTRACT (Abortable Byzantine faulT-toleRant stAte maChine replicaTion), a generic abstraction that simplifies the notoriously difficult task of developing BFT state machine replication algorithms. ABsTRACT resembles BFT-SMR and it can be used to make any shared service Byzantine fault-tolerant, with one exception: it may sometimes abort a client request. The non-triviality condition under which ABsTRACT cannot abort is a generic parameter. We view a BFT-SMR algorithm as a composition of instances of ABsTRACT, each instance developed and analyzed independently. To illustrate our approach, we describe two new optimally resilient BFT algorithms. The first, that makes use of our refined quorums, has the lowest time complexity among all BFT-SMR algorithms we know of, in synchronous periods that are free from contention and failures. The second algorithm has the highest peak throughput in failure-free and synchronous periods; this algorithm argues for general applicability of ABsTRACT in developing BFT shared services that feature optimal complexity, beyond the time complexity metric.

[1]  Wei Chen Abortable Consensus and Its Application to Probabilistic Atomic Broadcast , 2007 .

[2]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[3]  Bruce M. Maggs,et al.  Quorum placement in networks: minimizing network congestion , 2006, PODC '06.

[4]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[5]  TouegSam,et al.  Fault-tolerant wait-free shared objects , 1998 .

[6]  Liuba Shrira,et al.  HQ replication: a hybrid quorum protocol for byzantine fault tolerance , 2006, OSDI '06.

[7]  Rachid Guerraoui,et al.  Amnesic Distributed Storage , 2007, DISC.

[8]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[9]  Rachid Guerraoui,et al.  Computing with Reads and Writes in the Absence of Step Contention , 2005, DISC.

[10]  André Schiper,et al.  Improving Fast Paxos: being optimistic with no overhead , 2006, 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06).

[11]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[12]  Michael Dahlin,et al.  Small byzantine quorum systems , 2002, Proceedings International Conference on Dependable Systems and Networks.

[13]  Gabriel Bracha,et al.  An asynchronous [(n - 1)/3]-resilient consensus protocol , 1984, PODC '84.

[14]  Marcos K. Aguilera,et al.  Abortable and query-abortable objects and their efficient implementation , 2007, PODC '07.

[15]  Adi Shamir,et al.  A method for obtaining digital signatures and public-key cryptosystems , 1978, CACM.

[16]  Victor Shoup,et al.  Optimistic Asynchronous Atomic Broadcast , 2005, ICALP.

[17]  Kenneth P. Birman,et al.  Reliable communication in the presence of failures , 1987, TOCS.

[18]  Dennis Shasha,et al.  Building secure file systems out of byzantine storage , 2002, PODC '02.

[19]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[20]  David K. Gifford,et al.  Weighted voting for replicated data , 1979, SOSP '79.

[21]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[22]  André Schiper,et al.  Muteness detectors for consensus with Byzantine processes , 1998, PODC '98.

[23]  Rachid Guerraoui,et al.  A High Throughput Atomic Storage Algorithm , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[24]  Rachid Guerraoui,et al.  High Throughput Total Order Broadcast for Cluster Environments , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[25]  Victor Shoup,et al.  Secure and Efficient Asynchronous Broadcast Protocols , 2001, CRYPTO.

[26]  Jean-Philippe Martin,et al.  Fast Byzantine Consensus , 2006, IEEE Transactions on Dependable and Secure Computing.

[27]  Hugo Krawczyk,et al.  UMAC: Fast and Secure Message Authentication , 1999, CRYPTO.

[28]  Michael O. Rabin,et al.  Randomized byzantine generals , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[29]  Christos Faloutsos,et al.  Active Disks for Large-Scale Data Processing , 2001, Computer.

[30]  André Schiper Early consensus in an asynchronous system with a weak failure detector , 1997, Distributed Computing.

[31]  Richard A. Golding,et al.  The design and evaluation of network RAID protocols , 2004 .

[32]  Ramakrishna Kotla,et al.  Zyzzyva , 2007, SOSP.

[33]  Baruch Awerbuch,et al.  Complexity of network synchronization , 1985, JACM.

[34]  Marko Vukolic,et al.  Reliable Distributed Storage , 2009, Computer.

[35]  GhemawatSanjay,et al.  The Google file system , 2003 .

[36]  Colin J. Fidge,et al.  Logical time in distributed computing systems , 1991, Computer.

[37]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[38]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[39]  Steve Harrison,et al.  Boosting system performance with optimistic distributed protocols , 2001 .

[40]  Leslie Lamport,et al.  On interprocess communication , 1986, Distributed Computing.

[41]  HariGovind V. Ramasamy,et al.  Parsimonious Asynchronous Byzantine-Fault-Tolerant Atomic Broadcast , 2005, OPODIS.

[42]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[43]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[44]  Alfred Menezes,et al.  Handbook of Applied Cryptography , 2018 .

[45]  Idit Keidar,et al.  Wait-free regular storage from Byzantine components , 2007, Inf. Process. Lett..

[46]  Piotr Zielinski,et al.  Optimistically Terminating Consensus: All Asynchronous Consensus Protocols in One Framework , 2006, 2006 Fifth International Symposium on Parallel and Distributed Computing.

[47]  Rachid Guerraoui,et al.  On the Message Complexity of Indulgent Consensus , 2007, DISC.

[48]  Rachid Guerraoui,et al.  How fast can a distributed atomic read be? , 2004, PODC '04.

[49]  Michael K. Reiter,et al.  Efficient Byzantine-tolerant erasure-coded storage , 2004, International Conference on Dependable Systems and Networks, 2004.

[50]  Marko Vukolic,et al.  Refined quorum systems , 2007, PODC '07.

[51]  André Schiper,et al.  Optimistic Atomic Broadcast , 1998, DISC.

[52]  Marko Vukolic,et al.  How fast can a very robust read be? , 2006, PODC '06.

[53]  Leslie Lamport,et al.  Lower bounds for asynchronous consensus , 2006, Distributed Computing.

[54]  Atul Singh,et al.  BFT Protocols Under Fire , 2008, NSDI.

[55]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[56]  Marko Vukolic,et al.  Lucky Read/Write Access to Robust Atomic Storage , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[57]  Keith Marzullo,et al.  Synchronous Consensus for dependent process failures , 2003, 23rd International Conference on Distributed Computing Systems, 2003. Proceedings..

[58]  Marcin Paprzycki,et al.  Distributed Computing: Fundamentals, Simulations and Advanced Topics , 2001, Scalable Comput. Pract. Exp..

[59]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[60]  Rachid Guerraoui,et al.  Introduction to reliable distributed programming , 2006 .

[61]  Michael Dahlin,et al.  Minimal Byzantine Storage , 2002, DISC.

[62]  Arun Venkataramani,et al.  Separating agreement from execution for byzantine fault tolerant services , 2003, SOSP '03.

[63]  Achour Mostéfaoui,et al.  Consensus in One Communication Step , 2001, PaCT.

[64]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[65]  Prasad Jayanti,et al.  Adaptive and efficient abortable mutual exclusion , 2003, PODC '03.

[66]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[67]  Gregory R. Ganger,et al.  Ursa minor: versatile cluster-based storage , 2005, FAST'05.

[68]  Leslie Lamport,et al.  Interprocess Communication , 2020, Practical System Programming with C.

[69]  Ueli Maurer,et al.  Complete characterization of adversaries tolerable in secure multi-party computation (extended abstract) , 1997, PODC '97.

[70]  Philip M. Thambidurai,et al.  Interactive consistency with multiple failure modes , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[71]  Leslie Lamport,et al.  The +CAL Algorithm Language , 2006, NCA.

[72]  Andreas Haeberlen,et al.  Efficient Replica Maintenance for Distributed Storage Systems , 2006, NSDI.

[73]  Marko Vukolic,et al.  Gracefully Degrading Fair Exchange with Security Modules , 2005, EDCC.

[74]  Emmanuel Cecchet,et al.  Evaluation of a Group Communication Middleware for Clustered J2EE Application Servers , 2004, CoopIS/DOA/ODBASE.

[75]  Idit Keidar,et al.  Timeliness, failure-detectors, and consensus performance , 2006, PODC '06.

[76]  Sam Toueg,et al.  Asynchronous consensus and broadcast protocols , 1985, JACM.

[77]  R. Guerraoui,et al.  Best-Case Complexity of Asynchronous Byzantine Consensus , 2005 .

[78]  Idit Keidar,et al.  Byzantine disk paxos: optimal resilience with byzantine shared memory , 2004, PODC '04.

[79]  Robert H. Thomas,et al.  A Majority consensus approach to concurrency control for multiple copy databases , 1979, ACM Trans. Database Syst..

[80]  Victor Shoup,et al.  Random Oracles in Constantinople: Practical Asynchronous Byzantine Agreement Using Cryptography , 2000, Journal of Cryptology.

[81]  Moni Naor,et al.  The load, capacity and availability of quorum systems , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[82]  Michael Ben-Or,et al.  Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocols , 1983, PODC '83.

[83]  Dahlia Malkhi,et al.  Active Disk Paxos with infinitely many processes , 2002, PODC '02.

[84]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[85]  Mukesh Singhal,et al.  Logical Time: Capturing Causality in Distributed Systems , 1996, Computer.

[86]  Nancy A. Lynch,et al.  An introduction to input/output automata , 1989 .

[87]  Leslie Lamport,et al.  Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers [Book Review] , 2002, Computer.

[88]  Marko Vukolic,et al.  A Scalable and Oblivious Atomicity Assertion , 2008, CONCUR.

[89]  Arthur J. Bernstein,et al.  Efficient solutions to the replicated log and dictionary problems , 1984, PODC '84.

[90]  Eli Gafni,et al.  Round-by-round fault detectors (extended abstract): unifying synchrony and asynchrony , 1998, PODC '98.

[91]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[92]  Piotr Zieliński,et al.  Minimizing latency of agreement protocols , 2006 .

[93]  Michael K. Reiter,et al.  Byzantine quorum systems , 1997, STOC '97.

[94]  Assia Doudou Abstractions for Byzantine-resilient state machine replication , 2000 .