Certifying Safety when Implementing Consensus

Ensuring the correctness of distributed system implementations remains a challenging and largely unaddressed problem. In this paper we present a protocol that can be used to certify the safety of consensus implementations. Our proposed protocol is efficient both in terms of the number of additional messages sent and their size, and is designed to operate correctly in the presence of $n-1$ nodes failing in an $n$ node distributed system (assuming fail-stop failures). We also comment on how our construction might be generalized to certify other protocols and invariants.

[1]  Randy H. Katz,et al.  X-Trace: A Pervasive Network Tracing Framework , 2007, NSDI.

[2]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[3]  Shmuel Sagiv,et al.  Paxos made EPR: decidable reasoning about distributed protocols , 2017, Proc. ACM Program. Lang..

[4]  Butler W. Lampson,et al.  How to Build a Highly Available System Using Consensus , 1996, WDAG.

[5]  Moni Naor,et al.  The Power of Distributed Verifiers in Interactive Proofs , 2018, Electron. Colloquium Comput. Complex..

[6]  David R. Cheriton,et al.  Leases: an efficient fault-tolerant mechanism for distributed file cache consistency , 1989, SOSP '89.

[7]  Srinath T. V. Setty,et al.  IronFleet: proving practical distributed systems correct , 2015, SOSP.

[8]  Ilya Sergey,et al.  Programming and proving with distributed protocols , 2017, Proc. ACM Program. Lang..

[9]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[10]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[11]  R. V. Renesse,et al.  Derecho : Group Communication at the Speed of Light , 2016 .

[12]  Edsger W. Dijkstra,et al.  Self-stabilizing systems in spite of distributed control , 1974, CACM.

[13]  Christos H. Papadimitriou,et al.  The serializability of concurrent database updates , 1979, JACM.

[14]  Kenneth L. McMillan,et al.  Ivy: safety verification by interactive generalization , 2016, PLDI.

[15]  Gillat Kol,et al.  Interactive Distributed Proofs , 2018, PODC.

[16]  Amos Fiat,et al.  How to Prove Yourself: Practical Solutions to Identification and Signature Problems , 1986, CRYPTO.

[17]  Donald Beaver,et al.  Dapper, a Large-Scale Distributed Systems Tracing Infrastructure , 2010 .

[18]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[19]  Xi Wang,et al.  An Empirical Study on the Correctness of Formally Verified Distributed Systems , 2017, EuroSys.

[20]  K. Rustan M. Leino,et al.  Dafny: An Automatic Program Verifier for Functional Correctness , 2010, LPAR.

[21]  Tobias Nipkow,et al.  A Proof Assistant for Higher-Order Logic , 2002 .

[22]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[23]  César Sánchez,et al.  Runtime Verification for Decentralized and Distributed Systems ∗ , 2017 .

[24]  Leslie Lamport,et al.  Model Checking TLA+ Specifications , 1999, CHARME.

[25]  Leslie Lamport,et al.  Fast Paxos , 2006, Distributed Computing.

[26]  Boaz Patt-Shamir,et al.  Self-stabilization by local checking and correction , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[27]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[28]  Shay Kutten,et al.  Proof labeling schemes , 2005, PODC '05.

[29]  Radia Perlman,et al.  An algorithm for distributed computation of a spanningtree in an extended LAN , 1985, SIGCOMM '85.

[30]  Radia J. Perlman,et al.  An algorithm for distributed computation of a spanningtree in an extended LAN , 1985, SIGCOMM '85.

[31]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.