Challenges in Model Checking of Fault-tolerant Designs in TLA+

Although, historically, fault tolerance is connected to safetycritical systems, there has been an increasing interest in fault tolerance in mainstream application such as the cloud. There is a need for formal specification and verification of industrial fault-tolerant designs, since they integrate, in a non-trivial way, the ideas from distributed algorithms, whose correctness is usually based on very subtle mathematical arguments. More and more fault-tolerant designs are formally specified in TLA. Based on our experience in model checking of fault-tolerant distributed algorithms, we propose a research agenda towards model checking of fault-tolerant designs in TLA.

[1]  Thomas A. Henzinger,et al.  Lazy abstraction , 2002, POPL '02.

[2]  Helmut Veith,et al.  On the completeness of bounded model checking for threshold-based distributed algorithms: Reachability , 2014, Inf. Comput..

[3]  Günter Grünsteidl,et al.  TTP - A Protocol for Fault-Tolerant Real-Time Systems , 1994, Computer.

[4]  S. Rajamani,et al.  A decade of software model checking with SLAM , 2011, Commun. ACM.

[5]  Helmut Veith,et al.  Counterexample-guided abstraction refinement for symbolic model checking , 2003, JACM.

[6]  Leslie Lamport,et al.  Disk Paxos , 2003, Distributed Computing.

[7]  Mark Moir,et al.  A Framework for Formally Verifying Software Transactional Memory Algorithms , 2012, CONCUR.

[8]  Leslie Lamport,et al.  High-Level Specifications: Lessons from Industry , 2002, FMCO.

[9]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[10]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[11]  Frank S. de Boer,et al.  Formal Methods for Components and Objects , 2012, Lecture Notes in Computer Science.

[12]  André Schiper,et al.  The Heard-Of model: computing in distributed systems with benign faults , 2009, Distributed Computing.

[13]  Helmut Veith,et al.  SMT and POR Beat Counter Abstraction: Parameterized Model Checking of Threshold-Based Distributed Algorithms , 2015, CAV.

[14]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[15]  Thomas A. Henzinger,et al.  A Logic-Based Framework for Verifying Consensus Algorithms , 2014, VMCAI.

[16]  David G. Andersen,et al.  There is more consensus in Egalitarian parliaments , 2013, SOSP.

[17]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[18]  Stephan Merz,et al.  Harnessing SMT Solvers for TLA+ Proofs , 2012, Electron. Commun. Eur. Assoc. Softw. Sci. Technol..

[19]  Nancy A. Lynch,et al.  Specifying and proving properties of timed I/O automata using Tempo , 2008, Des. Autom. Embed. Syst..

[20]  Amir Pnueli,et al.  Liveness with (0, 1, ∞)-counter abstraction , 2002 .

[21]  André Medeiros,et al.  ZooKeeper ’ s atomic broadcast protocol : Theory and practice , 2012 .

[22]  Roxana Geambasu,et al.  Experiences with formal specification of fault-tolerant file systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[23]  Sriram K. Rajamani,et al.  The SLAM project: debugging system software via static analysis , 2002, POPL '02.

[24]  Hassen Saïdi,et al.  Construction of Abstract State Graphs with PVS , 1997, CAV.

[25]  Stephan Merz,et al.  Automatic Verification of TLA + Proof Obligations with SMT Solvers , 2012, LPAR.

[26]  Leslie Lamport,et al.  Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers [Book Review] , 2002, Computer.

[27]  Natarajan Shankar,et al.  PVS: A Prototype Verification System , 1992, CADE.

[28]  Helmut Veith,et al.  Parameterized model checking of fault-tolerant distributed algorithms by abstraction , 2013, FMCAD 2013.

[29]  Flavio Paiva Junqueira,et al.  Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[30]  Wang Yi,et al.  Developing UPPAAL over 15 years , 2011, Softw. Pract. Exp..

[31]  Kedar S. Namjoshi,et al.  Reasoning about rings , 1995, POPL '95.

[32]  Ramakrishna Kotla,et al.  Zyzzyva: speculative byzantine fault tolerance , 2007, TOCS.

[33]  Leslie Lamport,et al.  Model Checking TLA+ Specifications , 1999, CHARME.

[34]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.