论文信息 - Marching Band: Fault-Tolerance with Replicable Message Delivery Order

Marching Band: Fault-Tolerance with Replicable Message Delivery Order

Marching Band ensures the same total ordering of message deliveries in each possible execution history, providing replicable execution for a subset of piecewise deterministic applications. With Marching Band any number of failures can be tolerated with a sender-based logging. The main idea behind the algorithm is to log and then broadcast each sent message, with a precomputed tag describing ordering of the message delivery.

Arkadiusz Danilecki

[1] D. Manivannan,et al. FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems , 2009, J. Parallel Distributed Comput..

[2] Jeffrey Overbey,et al. A type and effect system for deterministic parallel Java , 2009, OOPSLA '09.

[3] Thomas Hérault,et al. Correlated set coordination in fault tolerant message logging protocols for many‐core clusters , 2013, Concurr. Comput. Pract. Exp..

[4] Robbert van Renesse,et al. Building adaptive systems using ensemble , 1998 .

[5] Arkadiusz Danilecki,et al. Forced Replicable Execution for a Subset of Piecewise Deterministic Applications with Deterministic Message Passing , 2014, 2014 15th International Conference on Parallel and Distributed Computing, Applications and Technologies.

[6] Kenneth P. Birman,et al. Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.

[7] Kenneth P. Birman,et al. A review of experiences with reliable multicast , 1999, Softw. Pract. Exp..

[8] Priya Narasimhan,et al. Static Analysis Meets Distributed Fault-Tolerance: Enabling State-Machine Replication with Nondeterminism , 2006, HotDep.

[9] Luis Ceze,et al. DDOS: taming nondeterminism in distributed systems , 2013, ASPLOS '13.

[10] Franck Cappello,et al. On Communication Determinism in Parallel HPC Applications , 2010, 2010 Proceedings of 19th International Conference on Computer Communications and Networks.

[11] D. Manivannan,et al. HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems , 2012, Future Gener. Comput. Syst..

[12] Andrzej Goscinski,et al. A survey and review of the current state of rollback-recovery for cluster systems , 2009 .

[13] Idit Keidar,et al. Group communication specifications: a comprehensive study , 2001, CSUR.

[14] Kenneth P. Briman. A review of experiences with reliable multicast , 1999 .

[15] Shahram Rahimi,et al. Domino-Effect Free Crash Recovery for Concurrent Failures in Cluster Federation , 2008, GPC.

[16] Sam Toueg,et al. Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[17] Marcos K. Aguilera,et al. Efficient atomic broadcast using deterministic merge , 2000, PODC '00.

[18] Brian Randell,et al. System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[19] Roy Friedman,et al. Starfish: Fault-Tolerant Dynamic MPI Programs on Clusters of Workstations , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[20] Luis Ceze,et al. Deterministic Process Groups in dOS , 2010, OSDI.

[21] Hong Ong,et al. VCCP: A transparent, coordinated checkpointing system for virtualization-based cluster computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[22] David B. Johnson,et al. Sender-Based Message Logging , 1987 .

[23] Franck Cappello,et al. HydEE: Failure Containment without Event Logging for Large Scale Send-Deterministic MPI Applications , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[24] Stephen A. Edwards,et al. A Determinizing Compiler , 2009 .

[25] Ion Stoica,et al. ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.