Towards Modeling and Model Checking Fault-Tolerant Distributed Algorithms

Fault-tolerant distributed algorithms are central for building reliable, spatially distributed systems. In order to ensure that these algorithms actually make systems more reliable, we must ensure that these algorithms are actually correct. Unfortunately, model checking state-of-the-art fault-tolerant distributed algorithms (such as Paxos) is currently out of reach except for very small systems.

[1]  Tatsuhiro Tsuchiya,et al.  Verification of consensus algorithms using satisfiability solving , 2011, Distributed Computing.

[2]  Mahyar R. Malekpour,et al.  Comments on the "Byzantine Self-Stabilizing Pulse Synchronization" Protocol: Counter-examples , 2006 .

[3]  Hagit Attiya,et al.  Wiley Series on Parallel and Distributed Computing , 2004, SCADA Security: Machine Learning Concepts for Intrusion Detection and Prevention.

[4]  Edmund M. Clarke,et al.  Reasoning about Networks with Many Identical Finite State Processes , 1989, Inf. Comput..

[5]  Vincent Danos,et al.  Reversible Communicating Systems , 2004, CONCUR.

[6]  Sam Toueg,et al.  Simulating authenticated broadcasts to derive simple fault-tolerant algorithms , 1987, Distributed Computing.

[7]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[8]  Ulrich Schmid,et al.  Booting clock synchronization in partially synchronous systems with hybrid process and link failures , 2007, Distributed Computing.

[9]  Astrit Ademaj Slightly-off-specification failures in the time-triggered architecture , 2002, Seventh IEEE International High-Level Design Validation and Test Workshop, 2002..

[10]  Helmut Veith,et al.  Counter Attack on Byzantine Generals: Parameterized Model Checking of Fault-tolerant Distributed Algorithms , 2012, ArXiv.

[11]  Leslie Lamport,et al.  On interprocess communication , 1986, Distributed Computing.

[12]  Amir Pnueli,et al.  Liveness with (0, 1, ∞)-counter abstraction , 2002 .

[13]  Helmut Veith,et al.  Proving Ptolemy Right: The Environment Abstraction Framework for Model Checking Concurrent Systems , 2008, TACAS.

[14]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[15]  Dana Fisman,et al.  On Verifying Fault Tolerance of Distributed Protocols , 2008, TACAS.

[16]  Martin Biely,et al.  Synchronous consensus under hybrid process and link failures , 2011, Theor. Comput. Sci..

[17]  Jean-Philippe Martin,et al.  Fast Byzantine Consensus , 2006, IEEE Transactions on Dependable and Secure Computing.

[18]  Nicola Santoro,et al.  Time is Not a Healer , 1989, STACS.

[19]  Helmut Veith,et al.  Verification by Network Decomposition , 2004, CONCUR.

[20]  Matthias Függer,et al.  Reconciling fault-tolerant distributed computing and systems-on-chip , 2011, Distributed Computing.

[21]  Sam Toueg,et al.  Asynchronous consensus and broadcast protocols , 1985, JACM.

[22]  Neeraj Suri,et al.  Efficient model checking of fault-tolerant distributed protocols , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[23]  André Schiper,et al.  The Heard-Of model: computing in distributed systems with benign faults , 2009, Distributed Computing.

[24]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[25]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[26]  Antonio Bucchiarone,et al.  Architecting Fault-tolerant Component-based Systems: from requirements to testing , 2007, Electron. Notes Theor. Comput. Sci..

[27]  Seif Haridi,et al.  Distributed Algorithms , 1992, Lecture Notes in Computer Science.

[28]  Helmut Veith,et al.  Parameterized model checking of fault-tolerant distributed algorithms by abstraction , 2013, FMCAD 2013.

[29]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[30]  Kedar S. Namjoshi,et al.  Reasoning about rings , 1995, POPL '95.

[31]  Diego Latella,et al.  A Formal Specification and Validation of a Critical System in Presence of Byzantine Errors , 2000, TACAS.

[32]  Marcos K. Aguilera,et al.  Consensus with Byzantine Failures and Little System Synchrony , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[33]  Maria Sorea,et al.  Model checking a fault-tolerant startup algorithm: from design exploration to exhaustive fault simulation , 2004, International Conference on Dependable Systems and Networks, 2004.

[34]  Gerard J. Holzmann,et al.  Validating requirements for fault tolerant systems using model checking , 1998, Proceedings of IEEE International Symposium on Requirements Engineering: RE '98.

[35]  David Powell Failure mode assumptions and assumption coverage , 1992 .

[36]  Rajeev Alur,et al.  A Temporal Logic of Nested Calls and Returns , 2004, TACAS.

[37]  Achour Mostéfaoui,et al.  Evaluating the condition-based approach to solve consensus , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[38]  Leslie Lamport,et al.  A new solution of Dijkstra's concurrent programming problem , 1974, Commun. ACM.

[39]  Martin S. Feather,et al.  Model-checking for validation of a fault protection system , 2001, Proceedings Sixth IEEE International Symposium on High Assurance Systems Engineering. Special Topic: Impact of Networking.

[40]  Shin Nakajima,et al.  The SPIN Model Checker : Primer and Reference Manual , 2004 .