IronFleet: proving practical distributed systems correct

Distributed systems are notorious for harboring subtle bugs. Verification can, in principle, eliminate these bugs a priori, but verification has historically been difficult to apply at full-program scale, much less distributed-system scale. We describe a methodology for building practical and provably correct distributed systems based on a unique blend of TLA-style state-machine refinement and Hoare-logic verification. We demonstrate the methodology on a complex implementation of a Paxos-based replicated state machine library and a lease-based sharded key-value store. We prove that each obeys a concise safety specification, as well as desirable liveness requirements. Each implementation achieves performance competitive with a reference system. With our methodology and lessons learned, we aim to raise the standard for distributed systems from "tested" to "correct."

[1]  Richard J. Lipton,et al.  Reduction: a method of proving properties of parallel programs , 1975, CACM.

[2]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[3]  Rance Cleaveland,et al.  Implementing mathematics with the Nuprl proof development system , 1986 .

[4]  Martín Abadi,et al.  The existence of refinement mappings , 1988, [1988] Proceedings. Third Annual Information Symposium on Logic in Computer Science.

[5]  Robert W. Floyd,et al.  Assigning Meanings to Programs , 1993 .

[6]  Leslie Lamport,et al.  The temporal logic of actions , 1994, TOPL.

[7]  Leslie Lamport,et al.  Reduction in TLA , 1998, CONCUR.

[8]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[9]  Miguel Castro,et al.  A Correctness Proof for a Practical Byzantine-Fault-Tolerant Replication Algorithm , 1999 .

[10]  Nancy A. Lynch,et al.  Using I/O automata for developing distributed systems , 2000 .

[11]  Nancy A. Lynch,et al.  Revisiting the PAXOS algorithm , 1997, Theor. Comput. Sci..

[12]  Francesco M. Donini,et al.  Automatic Support for Verification of Secure Transactions in Distributed Environment using Symbolic Model Checking , 2001 .

[13]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[14]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[15]  Dawson R. Engler,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Cmc: a Pragmatic Approach to Model Checking Real Code , 2022 .

[16]  Leslie Lamport,et al.  Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers [Book Review] , 2002, Computer.

[17]  Yuan Yu Using formal specifications to monitor and guide simulation: Verifying the cache coherence engine of the Alpha 21364 microprocessor , 2002 .

[18]  Nikola Bogunović,et al.  Formal Verification of Communication Protocols in Distributed Systems , 2003 .

[19]  Leslie Lamport,et al.  Checking Cache-Coherence Protocols with TLA+ , 2003, Formal Methods Syst. Des..

[20]  CohenErnie First-order verification of cryptographic protocols , 2003 .

[21]  Ernie Cohen First-order Verification of Cryptographic Protocols , 2003, J. Comput. Secur..

[22]  Jon Howell,et al.  Correctness of Paxos with Replica-Set-Specific Views , 2004 .

[23]  David Detlefs,et al.  Simplify: a theorem prover for program checking , 2005, JACM.

[24]  Leslie Lamport A theorem on atomicity in distributed algorithms , 2005, Distributed Computing.

[25]  Jon Howell,et al.  Distributed directory service in the Farsite file system , 2006, OSDI '06.

[26]  Jon Howell,et al.  The SMART way to migrate replicated stateful services , 2006, EuroSys.

[27]  Scott D. Stoller,et al.  Runtime analysis of atomicity for multithreaded programs , 2006, IEEE Transactions on Software Engineering.

[28]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[29]  Amin Vahdat,et al.  Mace: language support for building distributed systems , 2007, PLDI '07.

[30]  Jon Howell,et al.  The Farsite project: a retrospective , 2007, OPSR.

[31]  Nikolaj Bjørner,et al.  Z3: An Efficient SMT Solver , 2008, TACAS.

[32]  Thomas Ball,et al.  Finding and Reproducing Heisenbugs in Concurrent Programs , 2008, OSDI.

[33]  Scott A. Mahlke,et al.  Gadara: Dynamic Deadlock Avoidance for Multithreaded Programs , 2008, OSDI.

[34]  Serdar Tasiran,et al.  A calculus of atomic actions , 2009, POPL '09.

[35]  Leslie Lamport,et al.  The PlusCal Algorithm Language , 2009, ICTAC.

[36]  Tom Ridge Verifying distributed systems: the operational approach , 2009, POPL '09.

[37]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[38]  Stephan Merz,et al.  Model Checking the Pastry Routing Protocol , 2010 .

[39]  K. Rustan M. Leino,et al.  Dafny: An Automatic Program Verifier for Functional Correctness , 2010, LPAR.

[40]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[41]  Leslie Lamport,et al.  Byzantizing Paxos by Refinement , 2011, DISC.

[42]  Junfeng Yang,et al.  Practical software model checking via dynamic interface reduction , 2011, SOSP.

[43]  Flavio Paiva Junqueira,et al.  Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[44]  Jonathan M. McCune,et al.  Memoir: Practical State Continuity for Protected Modules , 2011, 2011 IEEE Symposium on Security and Privacy.

[45]  Gernot Heiser,et al.  Timing Analysis of a Protected Operating System Kernel , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[46]  Neeraj Suri,et al.  Efficient model checking of fault-tolerant distributed protocols , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[47]  D. Andersen,et al.  A Proof of Correctness for Egalitarian Paxos , 2012 .

[48]  Vincent Rahli,et al.  Interfacing with Proof Assistants for Domain Specific Programming Using EventML , 2012 .

[49]  Pamela Zave,et al.  Using lightweight modeling to understand chord , 2012, CCRV.

[50]  Samuel T. King,et al.  Verifying security invariants in ExpressOS , 2013, ASPLOS '13.

[51]  Lauretta O. Osho,et al.  Axiomatic Basis for Computer Programming , 2013 .

[52]  David G. Andersen,et al.  There is more consensus in Egalitarian parliaments , 2013, SOSP.

[53]  Neeraj Suri,et al.  Efficient Verification of Distributed Protocols Using Stateful Model Checking , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[54]  Danfeng Zhang,et al.  Ironclad Apps: End-to-End Security via Automated Full-System Verification , 2014, OSDI.

[55]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[56]  Mark Bickford,et al.  Developing Correctly Replicated Databases Using Formal Tools , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[57]  Gernot Heiser,et al.  Comprehensive formal verification of an OS microkernel , 2014, TOCS.

[58]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[59]  Yu Guo,et al.  Deep Specifications and Certified Abstraction Layers , 2015, POPL.

[60]  Pamela Zave How to Make Chord Correct (Using a Stable Base) , 2015, ArXiv.

[61]  Xi Wang,et al.  Verdi: a framework for implementing and formally verifying distributed systems , 2015, PLDI.

[62]  C. Newcombe,et al.  How Amazon web services uses formal methods , 2015, Commun. ACM.

[63]  Stefan Gottschalk Papers On Time And Tense , 2016 .