Scalable Causal Message Logging for Wide-Area Environments

Causal message logging spread recovery information around the network in which the processes execute. This is an attractive property for wide area networks: it can be used to replicate processes that are otherwise inaccessible due to network partitions. However, current causal message logging protocols do not scale to thousands ofp rocesses. We describe the Hierarchical Causal Logging Protocol (HCML) that is scalable. It uses a hierarchy of proxies to reduce the amount ofin formation a process needs to maintain. Proxies also act as caches for recovery information and reduce the overall message overhead by as much as 50%. HCML also leverages differences in bandwidth between processes that reduces overall message latency by as much as 97%.

[1]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[2]  Lorenzo Alvisi,et al.  Scalable causal message logging for wide‐area environments , 2003, Concurr. Comput. Pract. Exp..

[3]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[4]  Marvin Theimer,et al.  Managing update conflicts in Bayou, a weakly connected replicated storage system , 1995, SOSP.

[5]  Ian T. Foster,et al.  The Globus project: a status report , 1998, Proceedings Seventh Heterogeneous Computing Workshop (HCW'98).

[6]  L. Alvisi,et al.  Nonblocking and Orphan-Free Message Logging Protocols , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[7]  Liuba Shrira,et al.  Providing high availability using lazy replication , 1992, TOCS.

[8]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[9]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[10]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[11]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[12]  Michael Ogg,et al.  Experience with Distributed Replicated Objects: The Nile Project , 1998, Theory Pract. Object Syst..

[13]  Lorenzo Alvisi,et al.  Trade-offs in implementing causal message logging protocols , 1996, PODC '96.

[14]  LamportLeslie Time, clocks, and the ordering of events in a distributed system , 1978 .