Scaling Replicated State Machines with Compartmentalization [Technical Report]

State machine replication protocols, like MultiPaxos and Raft, are a critical component of many distributed systems and databases. However, these protocols offer relatively low throughput due to several bottlenecked components. Numerous existing protocols fix different bottlenecks in isolation but fall short of a complete solution. When you fix one bottleneck, another arises. In this paper, we introduce compartmentalization, the first comprehensive technique to eliminate state machine replication bottlenecks. Compartmentalization involves decoupling individual bottlenecks into distinct components and scaling these components independently. Compartmentalization has two key strengths. First, compartmentalization leads to strong performance. In this paper, we demonstrate how to compartmentalize MultiPaxos to increase its throughput by 6× on a write-only workload and 16× on a mixed read-write workload. Unlike other approaches, we achieve this performance without the need for specialized hardware. Second, compartmentalization is a technique, not a protocol. Industry practitioners can apply compartmentalization to their protocols incrementally without having to adopt a completely new protocol. PVLDB Reference Format: Michael Whittaker, Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Neil Giridharan, Joseph M. Hellerstein, Heidi Howard, Ion Stoica, and Adriana Szekeres. Scaling Replicated State Machines with Compartmentalization. PVLDB, 14(1): XXX-XXX, 2020. doi:XX.XX/XXX.XX PVLDB Artifact Availability: The source code, data, and/or other artifacts have been made available at http://vldb.org/pvldb/format_vol14.html. This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097. doi:XX.XX/XXX.XX

[1]  Robbert van Renesse,et al.  Scalog: Seamless Reconfiguration and Total Order in a Scalable Shared Log , 2020, NSDI.

[2]  Barbara Liskov,et al.  Viewstamped Replication Revisited , 2012 .

[3]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[4]  Xiaozhou Li,et al.  NetChain: Scale-Free Sub-RTT Coordination , 2018, NSDI.

[5]  Leslie Lamport,et al.  Paxos Made Simple , 2001 .

[6]  Flavio Paiva Junqueira,et al.  Scalable Agreement: Toward Ordering as a Service , 2010, HotDep.

[7]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[8]  Robert Griesemer,et al.  Paxos made live: an engineering perspective , 2007, PODC '07.

[9]  John K. Ousterhout,et al.  In Search of an Understandable Consensus Algorithm , 2014, USENIX ATC.

[10]  Jialin Li,et al.  Just Say NO to Paxos Overhead: Replacing Consensus with Network Ordering , 2016, OSDI.

[11]  Michael J. Freedman,et al.  Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads , 2009, USENIX Annual Technical Conference.

[12]  Irfan Sharif,et al.  CockroachDB: The Resilient Geo-Distributed SQL Database , 2020, SIGMOD Conference.

[13]  William Schultz,et al.  Tunable Consistency in MongoDB , 2019, Proc. VLDB Endow..

[14]  Arvind Krishnamurthy,et al.  Building consistent transactions with inconsistent replication , 2015, SOSP.

[15]  Leslie Lamport,et al.  Generalized Consensus and Paxos , 2005 .

[16]  André Schiper,et al.  Achieving High-Throughput State Machine Replication in Multi-core Systems , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[17]  Robbert van Renesse,et al.  Paxos Made Moderately Complex , 2015, ACM Comput. Surv..

[18]  Samer Al-Kiswany,et al.  FLAIR: Accelerating Reads with Consistency-Aware Network Routing , 2020, NSDI.

[19]  Dahlia Malkhi,et al.  Flexible Paxos: Quorum Intersection Revisited , 2016, OPODIS.

[20]  Daniel J. Abadi,et al.  Calvin: fast distributed transactions for partitioned database systems , 2012, SIGMOD Conference.

[21]  Leslie Lamport,et al.  Fast Paxos , 2006, Distributed Computing.

[22]  David G. Andersen,et al.  Paxos Quorum Leases: Fast Reads Without Sacrificing Writes , 2014, SoCC.

[23]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[24]  André Schiper,et al.  Tuning Paxos for High-Throughput with Batching and Pipelining , 2012, ICDCN.

[25]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[26]  Tony Tung,et al.  Scaling Memcache at Facebook , 2013, NSDI.

[27]  Ailidani Ailijiang,et al.  PigPaxos: Devouring the Communication Bottlenecks in Distributed Consensus , 2020, SIGMOD Conference.

[28]  André Schiper,et al.  S-Paxos: Offloading the Leader for High Throughput State Machine Replication , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[29]  Song Jiang,et al.  Workload analysis of a large-scale key-value store , 2012, SIGMETRICS '12.

[30]  Xin Jin,et al.  Harmonia: Near-Linear Scalability for Replicated Storage with In-Network Conflict Detection , 2019, Proc. VLDB Endow..

[31]  Murat Demirbas,et al.  Linearizable Quorum Reads in Paxos , 2019, HotStorage.

[32]  Keith Marzullo,et al.  Mencius: Building Efficient Replicated State Machine for WANs , 2008, OSDI.

[33]  David G. Andersen,et al.  There is more consensus in Egalitarian parliaments , 2013, SOSP.

[34]  Murat Demirbas,et al.  WPaxos: Wide Area Network Flexible Consensus , 2017, IEEE Transactions on Parallel and Distributed Systems.

[35]  Johannes Behl,et al.  Consensus-Oriented Parallelization: How to Earn Your First Million , 2015, Middleware.

[36]  Robbert van Renesse,et al.  Chain Replication for Supporting High Throughput and Availability , 2004, OSDI.

[37]  Brett D. Fleisch,et al.  The Chubby lock service for loosely-coupled distributed systems , 2006, OSDI '06.

[38]  Binoy Ravindran,et al.  Brief Announcement: A Family of Leaderless Generalized-Consensus Algorithms , 2016, PODC.

[39]  GhemawatSanjay,et al.  The Google file system , 2003 .

[40]  Roberto Palmieri,et al.  Speeding up Consensus by Chasing Fast Decisions , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[41]  Maurice Herlihy,et al.  Linearizability: a correctness condition for concurrent objects , 1990, TOPL.

[42]  Fernando Pedone,et al.  Ring Paxos: A high-throughput atomic broadcast protocol , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[43]  Fernando Pedone,et al.  Scalable State-Machine Replication , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[44]  André Schiper,et al.  Optimizing Paxos with batching and pipelining , 2013, Theor. Comput. Sci..