Revisiting the Paxos Foundations: A Look at Summer Internship Work at VMware Research

The summer of 2016 was buzzing with intern activity at the VMware Research Group (VRG), working with all the research team and with David Tennenhouse, Chief Research Officer of VMware. In this paper, we give a brief introduction to Flexible Paxos [4], one of the internship results. There were several other exciting outcomes; internships are a great way to participate in driving innovation at VMware! Flexible Paxos introduces a surprising observation concerning the foundations distributed computing. The observation revisits the basic requisites of Paxos [7, 8], Lamport’s widely adopted algorithmic foundation for fault tolerance and replication, and a pinnacle of his Turing award [1]. Since its publication, Paxos has been widely built upon in teaching, research and production systems. Paxos implements a fault tolerant state-machine among a group of nodes. At its core, Paxos uses two phases, each requires agreement from a subset of nodes (known as a quorum) to proceed. Throughout this manuscript, we will refer to the first phase as the leader election phase, and the second as the replication phase. The safety and liveness of Paxos is based on the guarantee that any two quorums will intersect. To satisfy this requirement, quorums are typically composed of any majority from a fixed set of nodes, although other quorum schemes have been proposed. In practice, we usually wish to reach agreement over a sequence of commands, not one. This is often referred to as the Multi-Paxos problem [3]. In Multi-Paxos, we use the leader election phase of Paxos to establish one node as a leader for all future commands, until it is replaced by another leader. We use the replication phase of Paxos to agree on a series of commands, one at a time. To commit a command, the leader must always communicate with at least a quorum of nodes and wait for them to accept the value. In the Flexible Paxos work, we observe that Paxos is conservative: