The Overhead of Safe Broadcast Persistency

Although the need of logging messages in secondary storage once they have been received has been stated in several papers that assumed a recoverable failure model, none of them analysed the overhead implied by that logging in case of using reliable broadcasts in a group communication system guaranteeing virtual synchrony. At a glance, it seems an excessive cost for its apparently limited advantages, but there are several scenarios that contradict this intuition. This paper surveys some of these configurations and outlines some benefits of this persistence-related approach.

[1]  Richard D. Schlichting,et al.  Fail-Stop Processors: An Approach to Designing Computing Systems , 1983 .

[2]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[3]  Alberto Bartoli,et al.  Online reconfiguration in replicated databases based on group communication , 2001, 2001 International Conference on Dependable Systems and Networks.

[4]  Michel Raynal,et al.  Atomic Broadcast in Asynchronous Crash-Recovery Distributed Systems and Its Use in Quorum-Based Replication , 2003, IEEE Trans. Knowl. Data Eng..

[5]  Luis Irún-Briz,et al.  Ensuring Progress in Amnesiac Replicated Systems , 2008, 2008 Third International Conference on Availability, Reliability and Security.

[6]  Pat Helland,et al.  Building on Quicksand , 2009, CIDR.

[7]  André Schiper,et al.  A new look at atomic broadcast in the asynchronous crash-recovery model , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[8]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[9]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[10]  Luís E. T. Rodrigues,et al.  From spontaneous total order to uniform total order: different degrees of optimistic delivery , 2006, SAC '06.

[11]  André Schiper,et al.  Beyond 1-Safety and 2-Safety for Replicated Databases: Group-Safety , 2004, EDBT.

[12]  Dean Jacobs,et al.  Principles for Inconsistency , 2009, CIDR.

[13]  David B. Lomet,et al.  Log-based recovery for middleware servers , 2007, SIGMOD '07.

[14]  Louise E. Moser,et al.  Extended virtual synchrony , 1994, 14th International Conference on Distributed Computing Systems.

[15]  Marcos K. Aguilera,et al.  Failure detection and consensus in the crash-recovery model , 1998, Distributed Computing.

[16]  Idit Keidar,et al.  Efficient message ordering in dynamic networks , 1996, PODC '96.

[17]  André Schiper,et al.  A Step Towards a New Generation of Group Communication Systems , 2003, Middleware.

[18]  André Schiper,et al.  Optimistic Atomic Broadcast , 1998, DISC.

[19]  Idit Keidar,et al.  Group communication specifications: a comprehensive study , 2001, CSUR.

[20]  Achour Mostéfaoui,et al.  Consensus in asynchronous systems where processes can crash and recover , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[21]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[22]  R. Jiménez-Peris,et al.  An Algorithm for Non-Intrusive , Parallel Recovery of Replicated Data and its Correctness , 2002 .

[23]  Roy Friedman,et al.  Failure detectors in omission failure environments , 1997, PODC '97.

[24]  Benjamin Vandiver,et al.  Detecting and tolerating Byzantine faults in database systems , 2008 .

[25]  André Schiper,et al.  Comparison of database replication techniques based on total order broadcast , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  Andreas Reuter,et al.  Transaction Processing: Concepts and Techniques , 1992 .

[27]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[28]  Richard D. Schlichting,et al.  Preserving and using context information in interprocess communication , 1989, TOCS.

[29]  JoAnne Holliday Replicated database recovery using multicast communication , 2001, Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001.

[30]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[31]  Nancy A. Lynch,et al.  Specifying and using a partitionable group communication service , 1997, PODC '97.

[32]  Fernando Pedone,et al.  Sprint: a middleware for high-performance transaction processing , 2007, EuroSys '07.

[33]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.