Kernel Paxos

State machine replication is a well-known technique to build fault-tolerant replicated systems. The technique guarantees that replicas of a service execute the same sequence of deterministic commands in the same total order. At the core of state machine replication is consensus, a distributed problem in which replicas agree on the next command to be executed. Among the various consensus algorithms proposed, Paxos stands out for its optimized resilience and communication. Much effort has been placed on implementing Paxos efficiently. Existing solutions make use of special network topologies, rely on specialized hardware, or exploit application semantics. Instead of proposing yet another variation of the original Paxos algorithm, this paper proposes a new strategy to increase performance of Paxos-based state machine replication. We introduce Kernel Paxos, an implementation of Paxos that significantly reduces communication overhead by avoiding system calls and TCP/IP stack. To reduce the number of context switches related to system calls, we provide Paxos as a kernel module. We present a detailed performance analysis of Kernel Paxos and compare it to a user-space equivalent implementation.

[1]  Sam Toueg,et al.  Unreliable failure detectors for reliable distributed systems , 1996, JACM.

[2]  David G. Andersen,et al.  There is more consensus in Egalitarian parliaments , 2013, SOSP.

[3]  Leslie Lamport Lower bounds for asynchronous consensus , 2003 .

[4]  Marko Vukolic,et al.  Refined quorum systems , 2007, PODC '07.

[5]  B SchneiderFred Implementing fault-tolerant services using the state machine approach: a tutorial , 1990 .

[6]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[7]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[8]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[9]  Jialin Li,et al.  Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering , 2016, OSDI.

[10]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[11]  Jialin Li,et al.  Designing Distributed Systems Using Approximate Synchrony in Data Center Networks , 2015, NSDI.

[12]  Gustavo Alonso,et al.  Consensus in a Box: Inexpensive Coordination in Hardware , 2016, NSDI.

[13]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[14]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[15]  Leslie Lamport,et al.  Generalized Consensus and Paxos , 2005 .

[16]  Fernando Pedone,et al.  NetPaxos: consensus at network speed , 2015, SOSR.

[17]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1983, PODS '83.

[18]  Claudiu Danilov,et al.  The Spread Toolkit: Architecture and Performance , 2004 .

[19]  Roberto Palmieri,et al.  Making Fast Consensus Generally Faster , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[20]  Louise E. Moser,et al.  The Totem single-ring ordering and membership protocol , 1995, TOCS.

[21]  Fernando Pedone,et al.  Ring Paxos: A high-throughput atomic broadcast protocol , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).