Achieving High-Throughput State Machine Replication in Multi-core Systems

The traditional architecture used by implementations of Replicated State Machines (RSM) does not fully exploit modern multi-core CPUs. This is increasingly the limiting factor in their performance, because network speeds are increasing much faster than the single-thread performance of CPUs. Thus, when deployed on Gigabit-class networks and exposed to a workload of small to medium size client requests, RSMs are often CPU-bound, as they are only able to leverage a few cores, even though many more may be available. In this work, we revisit the traditional architecture of a RSM implementation, showing how it can be parallelized so that its performance scales with the number of cores in the nodes. We do so by applying several good practices of concurrent programming to the specific case of state machine replication, including staged execution, workload partitioning, actors, and non-blocking data structures. We describe and test a Java prototype of our architecture, based on the Paxos protocol. With a workload consisting of small requests, we achieve a six times improvement in throughput using eight cores. More generally, in all our experiments we have consistently reached the limits of the network subsystem by using up to 12 cores, and do not observe any degradation when using up to 24 cores. Furthermore, the profiling results of our implementation show that even at peak throughput contention between threads is minimal, suggesting that the throughput would continue scaling given a faster network.

[1]  Ron Levy The complexity of reliable distributed storage , 2008 .

[2]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[3]  Paulo Veríssimo,et al.  Proceedings of the Sixth international conference on Hot topics in system dependability , 2010 .

[4]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[5]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[6]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[7]  Fernando Pedone,et al.  Ring Paxos: A high-throughput atomic broadcast protocol , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[8]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[9]  André Schiper,et al.  Tuning Paxos for High-Throughput with Batching and Pipelining , 2012, ICDCN.

[10]  Yair Amir,et al.  Paxos for System Builders , 2008 .

[11]  Flavio Paiva Junqueira,et al.  Scalable Agreement: Toward Ordering as a Service , 2010, HotDep.

[12]  Jun Rao,et al.  Using Paxos to Build a Scalable, Consistent, and Highly Available Datastore , 2011, Proc. VLDB Endow..

[13]  Martin Odersky,et al.  Scala Actors: Unifying thread-based and event-based programming , 2009, Theor. Comput. Sci..

[14]  Keith Marzullo,et al.  Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.