White-Box Atomic Multicast (Extended Version)

Atomic multicast is a communication primitive that delivers messages to multiple groups of processes according to some total order, with each group receiving the projection of the total order onto messages addressed to it. To be scalable, atomic multicast needs to be genuine, meaning that only the destination processes of a message should participate in ordering it. In this paper we propose a novel genuine atomic multicast protocol that in the absence of failures takes as low as 3 message delays to deliver a message when no other messages are multicast concurrently to its destination groups, and 5 message delays in the presence of concurrency. This improves the latencies of both the fault-tolerant version of classical Skeen's multicast protocol (6 or 12 message delays, depending on concurrency) and its recent improvement by Coelho et al. (4 or 8 message delays). To achieve such low latencies, we depart from the typical way of guaranteeing fault-tolerance by replicating each group with Paxos. Instead, we weave Paxos and Skeen's protocol together into a single coherent protocol, exploiting opportunities for white-box optimisations. We experimentally demonstrate that the superior theoretical characteristics of our protocol are reflected in practical performance pay-offs.

[1]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[2]  Flavio Paiva Junqueira,et al.  Zab: High-performance broadcast for primary-backup systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[3]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[4]  Tim Kraska,et al.  MDCC: multi-data center consistency , 2012, EuroSys '13.

[5]  Nancy A. Lynch,et al.  Consensus in the presence of partial synchrony , 1988, JACM.

[6]  Arvind Krishnamurthy,et al.  Building consistent transactions with inconsistent replication , 2015, SOSP.

[7]  Mikel Larrea,et al.  Optimal implementation of the weakest failure detector for solving consensus , 2000, Proceedings 19th IEEE Symposium on Reliable Distributed Systems SRDS-2000.

[8]  Fernando Pedone,et al.  Solving Atomic Multicast When Groups Crash , 2008, OPODIS.

[9]  Rachid Guerraoui,et al.  Genuine atomic multicast in asynchronous distributed systems , 2001, Theor. Comput. Sci..

[10]  Sam Toueg,et al.  A Modular Approach to Fault-Tolerant Broadcasts and Related Problems , 1994 .

[11]  Carole Delporte-Gallet,et al.  Fault-Tolerant Genuine Atomic Multicast to Multiple Groups , 2000, OPODIS.

[12]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[13]  André Schiper,et al.  Scalable atomic multicast , 1998, Proceedings 7th International Conference on Computer Communications and Networks (Cat. No.98EX226).

[14]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[15]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[16]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[17]  Fernando Pedone,et al.  P-Store: Genuine Partial Replication in Wide Area Networks , 2010, 2010 29th IEEE Symposium on Reliable Distributed Systems.

[18]  Marc Shapiro,et al.  Non-monotonic Snapshot Isolation: Scalable and Strong Consistency for Geo-replicated Transactional Systems , 2013, 2013 IEEE 32nd International Symposium on Reliable Distributed Systems.

[19]  Fernando Pedone,et al.  On the Inherent Cost of Atomic Broadcast and Multicast in Wide Area Networks , 2008, ICDCN.

[20]  Paulo R. Coelho,et al.  Fast Atomic Multicast , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[21]  Daniel J. Abadi,et al.  The FuzzyLog: A Partially Ordered Shared Log , 2018, OSDI.

[22]  Denis Conan,et al.  The Convoy Effect in Atomic Multicast , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems Workshops (SRDSW).

[23]  Paulo R. Coelho,et al.  Byzantine Fault-Tolerant Atomic Multicast , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[24]  Mikel Larrea,et al.  On the Implementation of Unreliable Failure Detectors in Partially Synchronous Systems , 2004, IEEE Trans. Computers.

[25]  Barbara Liskov,et al.  Granola: Low-Overhead Distributed Transaction Coordination , 2012, USENIX Annual Technical Conference.

[26]  Marcos K. Aguilera,et al.  Stable Leader Election , 2001, DISC.