Series in Informatics Partitioned Paxos via the Network Data Plane

Consensus protocols are the foundation for building fault-tolerant, distributed systems and services. They are also widely acknowledged as performance bottlenecks. Several recent systems have proposed accelerating these protocols using the network dataplane. But, while network-accelerated consensus shows great promise, current systems suffer from an important limitation: they assume that the network hardware also accelerates the application itself. Consequently, they provide a specialized replicated service, rather than providing a general-purpose high-performance consensus that fits any off-the-shelf application. To address this problem, this paper proposes Partitioned Paxos, a novel approach to network-accelerated consensus. The key insight behind Partitioned Paxos is to separate the two aspects of Paxos, agreement and execution, and optimize them separately. First, Partitioned Paxos uses the network forwarding plane to accelerate agreement. Then, it uses state partitioning and parallelization to accelerate execution at the replicas. Our experiments show that using this combination of data plane acceleration and parallelization, Partitioned Paxos is able to provide at least ×3 latency improvement and ×11 throughput improvement for a replicated instance of a RocksDB keyvalue store. Report Info

[1]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[2]  Roy Friedman,et al.  Using Group Communication Technology to Implement a Reliable andScalable Distributed IN Coprocessor , 1996 .

[3]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[4]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[5]  André Schiper,et al.  Generic Broadcast , 1999, DISC.

[6]  Barbara Liskov,et al.  Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems , 1999, PODC '88.

[7]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[8]  Péter Urbán,et al.  Solving Agreement Problems with Weak Ordering Oracles , 2002, EDCC.

[9]  André Schiper,et al.  Optimistic atomic broadcast: a pragmatic viewpoint , 2003, Theor. Comput. Sci..

[10]  Leslie Lamport Lower bounds for asynchronous consensus , 2003 .

[11]  André Schiper,et al.  Uniform consensus is harder than consensus , 2004, J. Algorithms.

[12]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[13]  Leslie Lamport,et al.  Generalized Consensus and Paxos , 2005 .

[14]  Leslie Lamport,et al.  Consensus on transaction commit , 2004, TODS.

[15]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.

[16]  Leslie Lamport,et al.  Fast Paxos , 2006, Distributed Computing.

[17]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[18]  David Mazières Paxos Made Practical , 2007 .

[19]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[20]  Werner Vogels,et al.  Eventually consistent , 2008, CACM.

[21]  Benjamin Reed,et al.  A simple totally ordered broadcast protocol , 2008, LADIS '08.

[22]  Fernando Pedone,et al.  Ring Paxos: A high-throughput atomic broadcast protocol , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[23]  Peng Li,et al.  Paxos Replicated State Machines as the Basis of a High-Performance Data Store , 2011, NSDI.

[24]  Daniel J. Abadi,et al.  Calvin: fast distributed transactions for partitioned database systems , 2012, SIGMOD Conference.

[25]  George Varghese,et al.  Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN , 2013, SIGCOMM.

[26]  David G. Andersen,et al.  There is more consensus in Egalitarian parliaments , 2013, SOSP.

[27]  Fernando Pedone,et al.  Geo-replicated storage with scalable deferred update replication , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[28]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[29]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[30]  Fernando Pedone,et al.  The Performance of Paxos in the Cloud , 2014, 2014 IEEE 33rd International Symposium on Reliable Distributed Systems.

[31]  Robbert van Renesse,et al.  Paxos Made Moderately Complex , 2015, ACM Comput. Surv..

[32]  Fernando Pedone,et al.  NetPaxos: consensus at network speed , 2015, SOSR.

[33]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[34]  Torsten Hoefler,et al.  DARE: High-Performance State Machine Replication on RDMA Networks , 2015, HPDC.

[35]  George Varghese,et al.  Compiling Packet Programs to Reconfigurable Switches , 2015, NSDI.

[36]  Jignesh M. Patel,et al.  Twitter Heron: Stream Processing at Scale , 2015, SIGMOD Conference.

[37]  Changwoo Min,et al.  Understanding Manycore Scalability of File Systems , 2016, USENIX Annual Technical Conference.

[38]  Carlo Contavalli,et al.  Maglev: A Fast and Reliable Software Network Load Balancer , 2016, NSDI.

[39]  Fernando Pedone,et al.  Paxos Made Switch-y , 2015, CCRV.

[40]  Gustavo Alonso,et al.  Consensus in a Box: Inexpensive Coordination in Hardware , 2016, NSDI.

[41]  Jialin Li,et al.  Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering , 2016, OSDI.

[42]  Marcin Wójcik,et al.  Where Has My Time Gone? , 2017, PAM.

[43]  Rachid Guerraoui,et al.  TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores , 2017, USENIX Annual Technical Conference.

[44]  Diana Andreea Popescu,et al.  PTPmesh: Data Center Network Latency Measurements Using PTP , 2017, MASCOTS 2017.

[45]  Paulo R. Coelho,et al.  Fast Atomic Multicast , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[46]  Jialin Li,et al.  Eris: Coordination-Free Consistent Transactions Using In-Network Concurrency Control , 2017, SOSP.

[47]  Minlan Yu,et al.  SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs , 2017, SIGCOMM.

[48]  Fernando Pedone,et al.  Infinite Resources for Optimistic Concurrency Control , 2018, NetCompute@SIGCOMM.

[49]  Xiaozhou Li,et al.  NetChain: Scale-Free Sub-RTT Coordination , 2018, NSDI.