Marching Band: Fault-Tolerance with Replicable Message Delivery Order

Marching Band ensures the same total ordering of message deliveries in each possible execution history, providing replicable execution for a subset of piecewise deterministic applications. With Marching Band any number of failures can be tolerated with a sender-based logging. The main idea behind the algorithm is to log and then broadcast each sent message, with a precomputed tag describing ordering of the message delivery.

[1]  D. Manivannan,et al.  FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems , 2009, J. Parallel Distributed Comput..

[2]  Jeffrey Overbey,et al.  A type and effect system for deterministic parallel Java , 2009, OOPSLA '09.

[3]  Thomas Hérault,et al.  Correlated set coordination in fault tolerant message logging protocols for many‐core clusters , 2013, Concurr. Comput. Pract. Exp..

[4]  Robbert van Renesse,et al.  Building adaptive systems using ensemble , 1998 .

[5]  Arkadiusz Danilecki,et al.  Forced Replicable Execution for a Subset of Piecewise Deterministic Applications with Deterministic Message Passing , 2014, 2014 15th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[6]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[7]  Kenneth P. Birman,et al.  A review of experiences with reliable multicast , 1999, Softw. Pract. Exp..

[8]  Priya Narasimhan,et al.  Static Analysis Meets Distributed Fault-Tolerance: Enabling State-Machine Replication with Nondeterminism , 2006, HotDep.

[9]  Luis Ceze,et al.  DDOS: taming nondeterminism in distributed systems , 2013, ASPLOS '13.

[10]  Franck Cappello,et al.  On Communication Determinism in Parallel HPC Applications , 2010, 2010 Proceedings of 19th International Conference on Computer Communications and Networks.

[11]  D. Manivannan,et al.  HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems , 2012, Future Gener. Comput. Syst..

[12]  Andrzej Goscinski,et al.  A survey and review of the current state of rollback-recovery for cluster systems , 2009 .

[13]  Idit Keidar,et al.  Group communication specifications: a comprehensive study , 2001, CSUR.

[14]  Kenneth P. Briman A review of experiences with reliable multicast , 1999 .

[15]  Shahram Rahimi,et al.  Domino-Effect Free Crash Recovery for Concurrent Failures in Cluster Federation , 2008, GPC.

[16]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[17]  Marcos K. Aguilera,et al.  Efficient atomic broadcast using deterministic merge , 2000, PODC '00.

[18]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[19]  Roy Friedman,et al.  Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[20]  Luis Ceze,et al.  Deterministic Process Groups in dOS , 2010, OSDI.

[21]  Hong Ong,et al.  VCCP: A transparent, coordinated checkpointing system for virtualization-based cluster computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[22]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[23]  Franck Cappello,et al.  HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[24]  Stephen A. Edwards,et al.  A Determinizing Compiler , 2009 .

[25]  Ion Stoica,et al.  ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.