Hybrid Message Logging. Combining advantages of Sender-based and Receiver-based Approaches

Abstract With the growing scale of High Performance Computing applications comes an increase in the number of interruptions as a consequence of hardware failures. As the tendency is to scale parallel executions to hundred of thousands of processes, fault tolerance is becoming an important matter. Uncoordinated fault tolerance protocols, such as message logging, seem to be the best option since coordinated protocols might compromise applications scalability. Considering that most of the overhead during failure-free executions is caused by message logging approaches, in this paper we propose a Hybrid Message Logging protocol. It focuses on combining the fast recovery feature of pessimistic receiver-based message logging with the low protection overhead introduced by pessimistic sender-based message logging. The Hybrid Message Logging aims to reduce the overhead introduced by pessimistic receiver-based approaches by allowing applications to continue normally before a received message is properly saved. In order to guarantee that no message is lost, a pessimistic sender-based logging is used to temporarily save messages while the receiver fully saves its received messages. Experiments have shown that we can achieve up to 43% overhead reduction compared to a pessimistic receiver- based logging approach.

[1]  Christine Morin,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[2]  Emilio Luque,et al.  Tuning SPMD Applications in Order to Increase Performability , 2013, 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications.

[3]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[4]  Thomas Hérault,et al.  Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[5]  D. Manivannan,et al.  HOPE: A Hybrid Optimistic checkpointing and selective Pessimistic mEssage logging protocol for large scale distributed systems , 2012, Future Gener. Comput. Syst..

[6]  Thomas Hérault,et al.  Correlated set coordination in fault tolerant message logging protocols for many‐core clusters , 2013, Concurr. Comput. Pract. Exp..

[7]  Harrick M. Vin,et al.  The cost of recovery in message logging protocols , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[8]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .