Verdi: a framework for implementing and formally verifying distributed systems

Distributed systems are difficult to implement correctly because they must handle both concurrency and failures: machines may crash at arbitrary points and networks may reorder, drop, or duplicate packets. Further, their behavior is often too complex to permit exhaustive testing. Bugs in these systems have led to the loss of critical data and unacceptable service outages. We present Verdi, a framework for implementing and formally verifying distributed systems in Coq. Verdi formalizes various network semantics with different faults, and the developer chooses the most appropriate fault model when verifying their implementation. Furthermore, Verdi eases the verification burden by enabling the developer to first verify their system under an idealized fault model, then transfer the resulting correctness guarantees to a more realistic fault model without any additional proof burden. To demonstrate Verdi's utility, we present the first mechanically checked proof of linearizability of the Raft state machine replication algorithm, as well as verified implementations of a primary-backup replication system and a key-value store. These verified systems provide similar performance to unverified equivalents.

[1]  Amin Vahdat,et al.  Life, death, and the critical transition: finding liveness bugs in systems code , 2007 .

[2]  Rance Cleaveland,et al.  Implementing mathematics with the Nuprl proof development system , 1986 .

[3]  Fan Zhang,et al.  Use of Formal Methods at Amazon Web Services , 2014 .

[4]  Amin Vahdat,et al.  Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code (Awarded Best Paper) , 2007, NSDI.

[5]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[6]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[7]  Christoph Kreitz,et al.  Building reliable, high-performance communication systems from components , 2000, OPSR.

[8]  Davide Sangiorgi,et al.  The Pi-Calculus - a theory of mobile processes , 2001 .

[9]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[10]  Lars Birkedal,et al.  Ynot: dependent types for imperative programs , 2008, ICFP.

[11]  Chris Hawblitzel,et al.  Safe to the last instruction: automated verification of a type-safe operating system , 2011, CACM.

[12]  Michael Norrish,et al.  Engineering with logic: HOL specification and symbolic-evaluation testing for TCP implementations , 2006, POPL '06.

[13]  Pamela Zave,et al.  Using lightweight modeling to understand chord , 2012, CCRV.

[14]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[15]  James L. Peterson,et al.  Petri Nets , 1977, CSUR.

[16]  Yu Luo,et al.  Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems , 2014, OSDI.

[17]  Sorin Lerner,et al.  Automating formal proofs for reactive systems , 2014, PLDI.

[18]  Adam Chlipala,et al.  Mostly-automated verification of low-level programs in computational separation logic , 2011, PLDI '11.

[19]  Tom Ridge Verifying distributed systems: the operational approach , 2009, POPL '09.

[20]  Nancy A. Lynch,et al.  Specifications and Proofs for Ensemble Layers , 1999, TACAS.

[21]  Xavier Leroy,et al.  Formal verification of a realistic compiler , 2009, CACM.

[22]  Zhendong Su,et al.  Compiler validation via equivalence modulo inputs , 2014, PLDI.

[23]  Viktor Kuncak,et al.  CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems , 2009, NSDI.

[24]  Nancy A. Lynch,et al.  Using I/O automata for developing distributed systems , 2000 .

[25]  Yingwei Luo,et al.  Failure Recovery: When the Cure Is Worse Than the Disease , 2013, HotOS.

[26]  Haoxiang Lin,et al.  MODIST: Transparent Model Checking of Unmodified Distributed Systems , 2009, NSDI.

[27]  Xuejun Yang,et al.  Finding and understanding bugs in C compilers , 2011, PLDI '11.

[28]  Vincent Rahli,et al.  Interfacing with Proof Assistants for Domain Specific Programming Using EventML , 2012 .

[29]  Ion Stoica,et al.  Friday: Global Comprehension for Distributed Replay , 2007, NSDI.

[30]  Leslie Lamport,et al.  Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers [Book Review] , 2002, Computer.

[31]  Mark Garland Hayden,et al.  The Ensemble System , 1998 .

[32]  Daniel Jackson,et al.  Software Abstractions - Logic, Language, and Analysis , 2006 .

[33]  Mark Bickford,et al.  Developing Correctly Replicated Databases Using Formal Tools , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[34]  Arjun Guha,et al.  Machine-verified network controllers , 2013, PLDI.

[35]  Michael Norrish,et al.  seL4: formal verification of an OS kernel , 2009, SOSP '09.

[36]  Amin Vahdat,et al.  Mace: language support for building distributed systems , 2007, PLDI '07.

[37]  Xuezheng Liu,et al.  D3S: Debugging Deployed Distributed Systems , 2008, NSDI.